Molecule detection system on a solid support

ABSTRACT

Disclosed herein are compositions and methods used for detecting different types of molecules associated with a site on a solid support.

RELATED APPLICATIONS

This application is a continuation of U.S. Ser. No. 13/919,910, entitled“MOLECULE DETECTION SYSTEM ON A SOLID SUPPORT”, filed Jun. 17, 2013which is a divisional of U.S. Ser. No. 12/885,106, entitled VARIATIONANALYSIS FOR MULTIPLE TEMPLATES ON A SOLID SUPPORT, filed Sep. 17, 2010,now U.S. Pat. No. 8,483,969 issued Jul. 9 2013, the disclosures of whichare incorporated herein by reference in their entireties.

TECHNOLOGICAL FIELD

The present technology relates to the fields of biological and molecularsciences. More particularly, the present technology relates tocompositions and methods for analyzing molecules, such as nucleic acids,associated with a solid support.

BACKGROUND

Molecular sequencing, such as nucleic acid sequencing, has been used ina wide range of biological applications. For example, analysis ofnucleic acid sequences has been used for identifying and classifyingmicroorganisms, diagnosing infectious diseases, detecting andcharacterizing genetic abnormalities, identifying genetic changesassociated with cancer, studying genetic susceptibility to disease, andmeasuring response to various types of treatment.

Nucleic acid sequencing methodology has evolved significantly in recentdecades. Today, many sequencing methodologies require extensive sampleprocessing prior to performing a sequencing run. As such, there is aneed for methods, systems and compositions that simplify portions of thesequencing process.

SUMMARY

The present disclosure relates to methods, systems and compositions fordetecting molecules. In particular, methods, systems and compositionsfor detecting multiple types of molecules on a solid support aredescribed.

Some embodiments relate to methods for detecting molecules. In someembodiments, such methods comprise the steps of (a) providing a solidsupport comprising molecules associated with a site on the solid supportsuch that the molecules are detected in aggregate during a detectionstep, wherein the site comprises at least two different types ofmolecules; (b) detecting a signal corresponding to the aggregate ofmolecules at the site; (c) estimating the fraction of different types ofmolecules at the site or estimating the amount of signal correspondingto different types of molecules at the site; (d) calculating the amountof signal corresponding to different types of molecules at the siteusing the fraction estimate, thereby obtaining a signal estimate orcalculating the fraction of different types of molecules at the siteusing the signal estimate, thereby obtaining a fraction estimate; and(e) iteratively updating the fraction estimate and signal estimate untilthe estimates converge, thereby detecting molecules associated with thesite.

In some embodiments of the above-described methods, the providing stepfurther comprises providing a mixture of molecules to the solid support.In other embodiments, the providing step further comprises associatingthe molecules with the site. In still other embodiments, the providingstep further comprises attaching the molecules at the site.

In some embodiments of the above-described methods, the estimating stepis performed by guessing the fraction of different types of molecules atthe site or guessing the amount of signal corresponding to differenttypes of molecules at the site. In other embodiments, the estimatingstep comprises performing a principal component analysis (PCA).

In some embodiments of the above-described methods, the updating stepcomprises performing a numerical optimization algorithm. In some suchmethods, the numerical optimization algorithm is based on an iterativemap search. In some such embodiments, the numerical optimizationalgorithm is based on Fienup's iteration map.

In some embodiments of the above-described methods, sequence data isobtained for one or more molecules. In some such methods, sequence datais obtained by a sequencing-by-synthesis process. In certainembodiments, the sequencing-by-synthesis process comprises apyrosequencing process.

In some embodiments of the above-described methods, the solid supportcomprises a bead. In some other embodiments, the solid support comprisesa flow-cell.

In some embodiments of the above-described methods, about 1,000 to about10,000 molecules are associated with the site. In other embodiments,about 2,000 to about 8,000 molecules are associated with the site. Inyet other embodiments, about 3,000 to about 6,000 molecules areassociated with the site. In some embodiments, the molecules areattached at the sites.

In still other embodiments of the methods described herein, a sitecomprises about 2 to about 10¹¹ molecules, about 2 to about 10¹⁰molecules, about 2 to about 10⁹ molecules, about 2 to about 10⁸molecules, about 2 to about 10⁷ molecules, about 2 to about 10⁶molecules, about 2 to about 10⁵ molecules or about 2 to about 10⁴molecules. In other embodiments, a site comprises about 10 to about 10¹¹molecules, about 10 to about 10¹⁰ molecules, about 10 to about 10⁹molecules, about 10 to about 10⁸ molecules, about 10 to about 10⁷molecules, about 10 to about 10⁶ molecules, about 10 to about 10⁵molecules or about 10 to about 10⁴ molecules. In still otherembodiments, the site comprises about 50 to about 10¹¹ molecules, about50 to about 10¹⁰ molecules, about 50 to about 10⁹ molecules, about 50 toabout 10⁸ molecules, about 50 to about 10⁷ molecules, about 50 to about10⁶ molecules, about 50 to about 10⁵ molecules or about 50 to about 10⁴molecules. In yet other embodiments, a site comprises about 100 to about10¹¹ molecules, about 100 to about 10¹⁰ molecules, about 100 to about10⁹ molecules, about 100 to about 10⁸ molecules, about 100 to about 10⁷molecules, about 100 to about 10⁶ molecules, about 100 to about 10⁵molecules or about 100 to about 10⁴ molecules. In any of theabove-described embodiments of the methods described herein, themolecules present at a site can be detected in aggregate. In someembodiments, the molecules are associated with the site. In otherembodiments, the molecules are attached at the site.

In a preferred embodiment, the initial number of molecules associatedwith a site ranges from about 10 to about 1000 molecules. In anotherpreferred embodiment, about 10 to about 500 molecules are initiallyassociated with the site. In yet another preferred embodiment, about 10to about 100 molecules are initially associated with the site.

In preferred embodiments of the above-described methods, the moleculescomprise nucleic acids. In some such methods, the nucleic acids areattached at the site. In some embodiments, the nucleic acids comprise afirst subpopulation of nucleic acids and a second subpopulation ofnucleic acids, wherein the nucleic acids of the first subpopulation eachhave an identical target region and the nucleic acids of the secondsubpopulation each have an identical region that is a variant of thetarget region.

In some embodiments, the nucleotide sequence of the target region of thenucleic acids of the first subpopulation has at least 1 nucleotide thatis different as compared to the nucleotide sequence of the variant ofthe target region of the nucleic acids of the second subpopulation. Insome embodiments, the nucleotide sequence of the target region of thenucleic acids of the first subpopulation has at least 3 nucleotides thatare different as compared to the nucleotide sequence of the variant ofthe target region of the nucleic acids of the second subpopulation.

In some embodiments, a nucleotide sequence difference between the targetregion in the nucleic acids of the first subpopulation and the variantof the target region in the nucleic acids of the second subpopulationcomprises at least one difference selected from the group consisting ofa mutation, a polymorphism, an insertion, a deletion, a substitution, asimple tandem repeat polymorphism, and a single nucleotide polymorphism(SNP).

In some embodiments, the nucleic acids comprise alleles of a geneticlocus from a polyploid organism. In some other embodiments, the nucleicacids comprise alternative splicing forms of a nucleic acid. In yetother embodiments, the nucleic acids comprise alleles of a genetic locusfrom a diploid organism.

Also described herein are molecule detection systems. The moleculedetection systems can comprise a solid support comprising moleculesassociated with a site on the solid support such that the molecules aredetected in aggregate, wherein the molecules comprise at least twodifferent types of molecules, and a detector configured to detect themolecules associated with the site. In some embodiments, the moleculesare attached at the site. In a preferred embodiment, the moleculescomprise nucleic acids.

In some embodiments of the molecule detection systems described herein,a site comprises about 2 to about 10¹¹ molecules, about 2 to about 10¹⁰molecules, about 2 to about 10⁹ molecules, about 2 to about 10⁸molecules, about 2 to about 10⁷ molecules, about 2 to about 10⁶molecules, about 2 to about 10⁵ molecules, about 2 to about 10⁴molecules. In other embodiments, a site comprises about 10 to about 10¹¹molecules, about 10 to about 10¹⁰ molecules, about 10 to about 10⁹molecules, about 10 to about 10⁸ molecules, about 10 to about 10⁷molecules, about 10 to about 10⁶ molecules, about 10 to about 10⁵molecules, about 10 to about 10⁴ molecules. In still other embodiments,the site comprises about 50 to about 10¹¹ molecules, about 50 to about10¹⁰ molecules, about 50 to about 10⁹ molecules, about 50 to about 10⁸molecules, about 50 to about 10⁷ molecules, about 50 to about 10⁶molecules, about 50 to about 10⁵ molecules, about 50 to about 10⁴molecules. In yet other embodiments, a site comprises about 100 to about10¹¹ molecules, about 100 to about 10¹⁰ molecules, about 100 to about10⁹ molecules, about 100 to about 10⁸ molecules, about 100 to about 10⁷molecules, about 100 to about 10⁶ molecules, about 100 to about 10⁵molecules, about 100 to about 10⁴ molecules. In any of theabove-described embodiments of the molecule detection systems describedherein, the molecules present at a site can be detected in aggregate. Insome embodiments, the molecules are associated with the site. In otherembodiments, the molecules are attached at the site. In certainembodiments, the molecules comprise nucleic acids.

Some embodiments of the above-described molecule detection systems canfurther comprise a fluid handling system configured to apply fluid tothe site. Other embodiments of the above-described molecule detectionsystems can further comprise a light source configured to provide anexcitation beam to the site.

Some embodiments of the above-described molecule detection systems canfurther comprise a first data processing module configured to estimatethe fraction of different types of molecules at the site or the amountof signal corresponding to different types of molecules at the site. Insome embodiments, the first data processing module is also used fordetermining the variation associated with the estimate. In otherembodiments, the determining step is performed using a separate dataprocessing module.

In some embodiments of such systems, the systems can further comprise asecond data processing module configured to calculate the amount ofsignal corresponding to different types of molecules at the site usingthe fraction estimate or to calculate the fraction of different types ofmolecules at the site using the signal estimate. In other embodiments ofsuch systems, the systems can further comprise a third data processingmodule configured to iteratively update the fraction estimate and signalestimate.

In some embodiments described herein, a plurality of data processingfunctions can be included together in one or a few modules. In otherembodiments, data processing functions can be included separately inseparate modules. It will be appreciated that a first data processingmodule need not be separate from a second data processing module. Insome embodiments, a first data processing module and a second dataprocessing module are the same data processing module.

In some embodiments of the above-described molecule detection systems,the systems are configured to identify the nucleotide sequence of atarget region of a nucleic acid.

In some embodiments of the above-described molecule detection systems,the site is a well. In some other embodiments of the above-described themolecule detection systems, the site is a bead, where the bead beingpresent in a well of a multiwell substrate. In such systems, the wellcan further comprise beads having an enzyme attached thereto. In someembodiments, the enzyme comprises sulfurylase. In some embodiments, theenzyme comprises luciferase. In some embodiments, the enzyme comprises aseparate sulfurylase enzyme and a separate luciferase enzyme. In someembodiments, the well further comprises beads having neither a nucleicacid nor an enzyme attached thereto.

In some embodiments of the above-described the molecule detectionsystems, about 1,000 to about 10,000 molecules are associated with thesite. In some other embodiments, about 2,000 to about 8,000 moleculesare associated with the site. In yet some other embodiments, about 3,000to about 6,000 molecules are associated with the site. In a preferredembodiment, the molecules are attached at the site.

In a preferred embodiment, the initial number of molecules associatedwith a site ranges from about 10 to about 1000 molecules. In anotherpreferred embodiment, about 10 to about 500 molecules are initiallyassociated with the site. In yet another preferred embodiment, about 10to about 100 molecules are initially associated with the site.

In some embodiments of the above-described molecule detection systems,the solid support comprises a bead. In some other embodiments, the solidsupport comprises a flow-cell.

Also provided herein are methods of identifying a target region of anucleic acid. The methods can comprise (a) associating a firstsubpopulation of nucleic acids with a site on a solid support, whereinnucleic acids of the first subpopulation comprise an identical targetregion; (b) associating a second subpopulation of nucleic acids with thesite on the solid support, wherein nucleic acids of the secondsubpopulation comprise an identical target region that is a variant ofthe target region of the nucleic acids of the first subpopulation; (c)detecting a signal corresponding to one or more nucleotides of thetarget region of first subpopulation nucleic acids and one or morenucleotides of the variant of the target region of second subpopulationnucleic acids; (d) estimating the fraction of first subpopulationnucleic acids and second subpopulation nucleic acids associated with thesite or estimating the amount of signal corresponding to firstsubpopulation nucleic acids and second subpopulation nucleic acidsassociated with the site; (e) calculating the amount of signalcorresponding to first subpopulation nucleic acids and secondsubpopulation nucleic acids associated with the site using the fractionestimate, or calculating the fraction of first subpopulation nucleicacids and second subpopulation nucleic acids associated with the siteusing the signal estimate; and (f) iteratively updating the fractionestimate and signal estimate until the estimates converge, therebyidentifying a target region of a nucleic acid.

In some embodiments of the above-described methods, step (a) comprisesattaching first subpopulation nucleic acids and second subpopulationnucleic acids to the solid support.

In some embodiments of the above-described methods, step (d) comprisesperforming a principal component analysis (PCA).

In some embodiments of the above-described methods, step (f) comprisesperforming a numerical optimization algorithm. In some such embodiments,the numerical optimization algorithm is based on iterative map search.In some other embodiments, the numerical optimization algorithm is basedon Fienup's iteration map.

In some embodiments of the above-described methods, sequence data isobtained from both first and second subpopulation nucleic acids. In somesuch embodiments, sequence data is obtained by a sequencing-by-synthesisprocess. In some embodiments, the sequencing-by-synthesis processcomprises a pyrosequencing process.

In some embodiments of the above-described methods, the solid supportcomprises a bead. In other embodiments of the above-described methods,the solid support comprises a flow-cell.

In some embodiments of the above-described methods, about 1,000 to about10,000 nucleic acids are associated with the site. In other embodiments,about 2,000 to about 8,000 nucleic acids are associated with the site.In yet other embodiments, about 3,000 to about 6,000 nucleic acids areassociated with the site. In some embodiments, the nucleic acids areattached at the sites.

In a preferred embodiment, the initial number of molecules associatedwith a site ranges from about 10 to about 1000 molecules. In anotherpreferred embodiment, about 10 to about 500 molecules are initiallyassociated with the site. In yet another preferred embodiment, about 10to about 100 molecules are initially associated with the site.

In still other embodiments of the methods of identifying a target regionof a nucleic acid described herein, a site comprises about 2 to about10¹¹ nucleic acids, about 2 to about 10¹⁰ nucleic acids, about 2 toabout 10⁹ nucleic acids, about 2 to about 10⁸ nucleic acids, about 2 toabout 10⁷ nucleic acids, about 2 to about 10⁶ nucleic acids, about 2 toabout 10⁵ nucleic acids or about 2 to about 10⁴ nucleic acids. In otherembodiments, a site comprises about 10 to about 10¹¹ nucleic acids,about 10 to about 10¹⁰ nucleic acids, about 10 to about 10⁹ nucleicacids, about 10 to about 10⁸ nucleic acids, about 10 to about 10⁷nucleic acids, about 10 to about 10⁶ nucleic acids, about 10 to about10⁵ nucleic acids or about 10 to about 10⁴ nucleic acids. In still otherembodiments, the site comprises about 50 to about 10¹¹ nucleic acids,about 50 to about 10¹⁰ nucleic acids, about 50 to about 10⁹ nucleicacids, about 50 to about 10⁸ nucleic acids, about 50 to about 10⁷nucleic acids, about 50 to about 10⁶ nucleic acids, about 50 to about10⁵ nucleic acids or about 50 to about 10⁴ nucleic acids. In yet otherembodiments, a site comprises about 100 to about 10¹¹ nucleic acids,about 100 to about 10¹⁰ nucleic acids, about 100 to about 10⁹ nucleicacids, about 100 to about 10⁸ nucleic acids, about 100 to about 10⁷nucleic acids, about 100 to about 10⁶ nucleic acids, about 100 to about10⁵ nucleic acids or about 100 to about 10⁴ nucleic acids. In any of theabove-described embodiments of the methods described herein, the nucleicacids present at a site can be detected in aggregate. In someembodiments, the nucleic acids are associated with the site. In otherembodiments, the nucleic acids are attached at the site.

In some embodiments of the above-described methods, the nucleotidesequence of the target region of first subpopulation nucleic acids hasat least 1 nucleotide that is different as compared to the nucleotidesequence of the variant of the target region of second subpopulationnucleic acids. In some other embodiments, the nucleotide sequence of thetarget region of first subpopulation nucleic acids has at least 3nucleotides that are different as compared to the nucleotide sequence ofthe variant of the target region of second subpopulation nucleic acids.

In some embodiments of the above-described methods, a nucleotidesequence difference between the target region in first subpopulationnucleic acids of and the variant of the target region in secondsubpopulation nucleic acids comprises at least one difference selectedfrom the group consisting of a mutation, a polymorphism, an insertion, adeletion, a substitution, a simple tandem repeat polymorphism, and asingle nucleotide polymorphism (SNP).

In some embodiments of the above-described methods, first subpopulationnucleic acids and second subpopulation nucleic acids comprise alleles ofa genetic locus from a polyploid organism. In some other embodiments,first subpopulation nucleic acids and second subpopulation nucleic acidscomprise alternative splicing forms of a nucleic acid. In yet some otherembodiments, first subpopulation nucleic acids and second subpopulationnucleic acids comprise alleles of a genetic locus from a diploidorganism.

Also provided herein are methods for identifying a biosignature. Themethods can comprise the steps of (a) providing samples obtained from aplurality of subjects, wherein the samples comprise molecules; (b)tagging molecules from the samples so as to identify the subject fromwhich each sample originated; (c) associating molecules from the sampleswith a site on a solid support such that the molecules are detected inaggregate during a detection step, wherein the site comprises at leasttwo different types of molecules; (d) obtaining a biosignature formolecules associated with the site by: i) detecting a signalcorresponding to the aggregate of the molecules at the site, ii)estimating the fraction of different types of molecules at the site orthe amount of signal corresponding to different types of molecules atthe site, iii) calculating the amount of signal corresponding todifferent types of molecules at the site using the fraction estimate, orcalculating the fraction of different types of molecules at the siteusing the signal estimate, and iv) iteratively updating the fractionestimate and signal estimate until the estimates converge, therebyobtaining a biosignature for molecules at the site; and (e) comparingthe biosignature obtained in step (d) to a reference biosignature,thereby identifying the biosignature. In a preferred embodiment, themolecules are attached at the site.

In some embodiments of the above-described methods, about 1,000 to about10,000 molecules are associated with the site. In other embodiments,about 2,000 to about 8,000 molecules are associated with the site. Inyet other embodiments, about 3,000 to about 6,000 molecules areassociated with the site. In some embodiments, the molecules areattached at the sites.

In a preferred embodiment, the initial number of molecules associatedwith a site ranges from about 10 to about 1000 molecules. In anotherpreferred embodiment, about 10 to about 500 molecules are initiallyassociated with the site. In yet another preferred embodiment, about 10to about 100 molecules are initially associated with the site.

In still other embodiments of the methods described herein, a sitecomprises about 2 to about 10¹¹ molecules, about 2 to about 10¹⁰molecules, about 2 to about 10⁹ molecules, about 2 to about 10⁸molecules, about 2 to about 10⁷ molecules, about 2 to about 10⁶molecules, about 2 to about 10⁵ molecules or about 2 to about 10⁴molecules. In other embodiments, a site comprises about 10 to about 10¹¹molecules, about 10 to about 10¹⁰ molecules, about 10 to about 10⁹molecules, about 10 to about 10⁸ molecules, about 10 to about 10⁷molecules, about 10 to about 10⁶ molecules, about 10 to about 10⁵molecules or about 10 to about 10⁴ molecules. In still otherembodiments, the site comprises about 50 to about 10¹¹ molecules, about50 to about 10¹⁰ molecules, about 50 to about 10⁹ molecules, about 50 toabout 10⁸ molecules, about 50 to about 10⁷ molecules, about 50 to about10⁶ molecules, about 50 to about 10⁵ molecules or about 50 to about 10⁴molecules. In yet other embodiments, a site comprises about 100 to about10¹¹ molecules, about 100 to about 10¹⁰ molecules, about 100 to about10⁹ molecules, about 100 to about 10⁸ molecules, about 100 to about 10⁷molecules, about 100 to about 10⁶ molecules, about 100 to about 10⁵molecules or about 100 to about 10⁴ molecules. In any of theabove-described embodiments of the methods described herein, themolecules present at a site can be detected in aggregate. In someembodiments, the molecules are associated with the site. In otherembodiments, the molecules are attached at the site.

In a preferred embodiment of the above-described methods, the moleculescomprise nucleic acids. In some such embodiments, the nucleic acidscomprise a marker from a pathogen. In certain embodiments, the pathogencomprises a pathogen selected from the group consisting of a virus, abacterium and a eukaryotic cell. In some embodiments, the eukaryoticcell can be a cancer cell.

In some embodiments of the above-described methods, the sample comprisesan abnormal cell type.

In some embodiments of the above-described methods, the sample comprisesa mixture of eukaryotic cell types, a mixture of microorganisms or amixture of both eukaryotic cell types and microorganisms.

In some embodiments of the above-described methods, the subject is aliving material or organism. In other embodiments, the subject is notliving material.

In some embodiments of the above-described methods, the sample compriseshuman flora. In some such embodiments, the human flora is selected fromthe group consisting of skin flora, nasal flora, gut flora, vaginalflora, and oral cavity flora.

In some embodiments of the above-described methods, the sample isobtained from a cancer patient.

Also described herein is a solid support including a population ofnucleic acids associated with a site on the solid support such thatnucleic acids of the population of nucleic acids are detected inaggregate, the population of nucleic acids comprising a firstsubpopulation and a second subpopulation, wherein nucleic acids of thefirst subpopulation comprise an identical target region and nucleicacids of the second subpopulation comprise an identical region that is avariant of the target region.

In embodiments of the above-described solid supports, about 1,000 toabout 10,000 nucleic acids are associated with the site. In some othersuch embodiments, about 2,000 to about 8,000 nucleic acids areassociated with the site. In yet other such embodiments, about 3,000 toabout 6,000 nucleic acids are associated with the site. In someembodiments of the above-described solid support, the nucleic acids areattached at the site.

In a preferred embodiment, the initial number of molecules associatedwith a site ranges from about 10 to about 1000 molecules. In anotherpreferred embodiment, about 10 to about 500 molecules are initiallyassociated with the site. In yet another preferred embodiment, about 10to about 100 molecules are initially associated with the site.

In other embodiments of the above-described solid support, a sitecomprises about 2 to about 10¹¹ nucleic acids, about 2 to about 10¹⁰nucleic acids, about 2 to about 10⁹ nucleic acids, about 2 to about 10⁸nucleic acids, about 2 to about 10⁷ nucleic acids, about 2 to about 10⁶nucleic acids, about 2 to about 10⁵ nucleic acids or about 2 to about10⁴ nucleic acids. In other embodiments, a site comprises about 10 toabout 10¹¹ nucleic acids, about 10 to about 10¹⁰ nucleic acids, about 10to about 10⁹ nucleic acids, about 10 to about 10⁸ nucleic acids, about10 to about 10⁷ nucleic acids, about 10 to about 10⁶ nucleic acids,about 10 to about 10⁵ nucleic acids or about 10 to about 10⁴ nucleicacids. In still other embodiments, the site comprises about 50 to about10¹¹ nucleic acids, about 50 to about 10¹⁰ nucleic acids, about 50 toabout 10⁹ nucleic acids, about 50 to about 10⁸ nucleic acids, about 50to about 10⁷ nucleic acids, about 50 to about 10⁶ nucleic acids, about50 to about 10⁵ nucleic acids or about 50 to about 10⁴ nucleic acids. Inyet other embodiments, a site comprises about 100 to about 10¹¹ nucleicacids, about 100 to about 10¹⁰ nucleic acids, about 100 to about 10⁹nucleic acids, about 100 to about 10⁸ nucleic acids, about 100 to about10⁷ nucleic acids, about 100 to about 10⁶ nucleic acids, about 100 toabout 10⁵ nucleic acids or about 100 to about 10⁴ nucleic acids. In anyof the above-described embodiments of the solid supports describedherein, the nucleic acids present at a site can be detected inaggregate. In some embodiments, the nucleic acids are associated withthe site. In other embodiments, the nucleic acids are attached at thesite.

In some embodiments of the above-described solid support, a nucleotidesequence difference between the target region in first subpopulationnucleic acids and the variant of the target region in secondsubpopulation nucleic acids comprises at least one difference selectedfrom the group consisting of a mutation, a polymorphism, an insertion, adeletion, a substitution, a simple tandem repeat polymorphism, and asingle nucleotide polymorphism (SNP).

In some embodiments of the above-described solid support, the populationof nucleic acids comprises alleles of a genetic locus from a polyploidorganism. In other embodiments, the population of nucleic acidscomprises alternative splicing forms of a nucleic acid. In yet otherembodiments, the population of nucleic acids comprises alleles of agenetic locus from a diploid organism.

Also described herein are mixtures of beads comprising a plurality ofbeads. In some embodiments, each bead of the plurality of beadscomprises a first subpopulation and a second subpopulation of nucleicacids, wherein the first subpopulation and the second subpopulation ofnucleic acids are associated with the bead such that they are detectedin aggregate. In a preferred embodiment, the nucleic acids of the firstsubpopulation each comprise an identical target region and the nucleicacids of the second subpopulation each comprise an identical region thatis a variant of the target region.

In some embodiments of the above-described mixture of beads, the nucleicacids are attached to each bead of the plurality of beads.

In some embodiments, the mixture of beads is distributed on thesubstrate. In other embodiments, the plurality of beads having beadscomprising both the first subpopulation and a second subpopulation ofnucleic acids is distributed on a substrate. In some embodiments, thedistribution of beads on the substrate is a random distribution. Inother embodiments, the substrate comprises wells and the beads aredistributed in the wells. In yet other embodiments, wells of thesubstrate further comprise beads having an enzyme attached thereto. Inpreferred embodiments, the enzyme can comprise sulfurylase, luciferaseor a combination of sulfurylase and luciferase. In some embodiments,wells of the substrate further comprise beads having neither a nucleicacid nor enzyme attached thereto.

Also provided herein are beads comprising a first subpopulation ofcapture nucleic acids having a competitor molecule hybridized theretoand a second subpopulation of capture nucleic acids comprising a regionthat permits hybridization of a complementary molecule.

Also provided herein are beads comprising capture nucleic acidshybridized with an amplified nucleic acid comprising a degenerate tag,the degenerate tag being hybridized to a capture nucleic acid. In someembodiments, the bead is present in a channel of a substrate. In otherembodiments, the bead is present in a well of a multiwell substrate. Ina preferred embodiment, the well is configured to hold a single beadhaving the amplified nucleic acids hybridized thereto.

In embodiments of the above-described beads or mixtures of beads, about1,000 to about 10,000 nucleic acids are associated with the bead. Insome other such embodiments, about 2,000 to about 8,000 nucleic acidsare associated with the bead. In yet other such embodiments, about 3,000to about 6,000 nucleic acids are associated with the bead. In someembodiments of the above-described beads or mixtures of beads, thenucleic acids are attached at the bead.

In a preferred embodiment, the initial number of molecules associatedwith a site ranges from about 10 to about 1000 molecules. In anotherpreferred embodiment, about 10 to about 500 molecules are initiallyassociated with the site. In yet another preferred embodiment, about 10to about 100 molecules are initially associated with the site.

In other embodiments of the above-described beads or mixtures of beads,a bead comprises about 2 to about 10¹¹ nucleic acids, about 2 to about10¹⁰ nucleic acids, about 2 to about 10⁹ nucleic acids, about 2 to about10⁸ nucleic acids, about 2 to about 10⁷ nucleic acids, about 2 to about10⁶ nucleic acids, about 2 to about 10⁵ nucleic acids or about 2 toabout 10⁴ nucleic acids. In other embodiments, a bead comprises about 10to about 10¹¹ nucleic acids, about 10 to about 10¹⁰ nucleic acids, about10 to about 10⁹ nucleic acids, about 10 to about 10⁸ nucleic acids,about 10 to about 10⁷ nucleic acids, about 10 to about 10⁶ nucleicacids, about 10 to about 10⁵ nucleic acids or about 10 to about 10⁴nucleic acids. In still other embodiments, the bead comprises about 50to about 10¹¹ nucleic acids, about 50 to about 10¹⁰ nucleic acids, about50 to about 10⁹ nucleic acids, about 50 to about 10⁸ nucleic acids,about 50 to about 10⁷ nucleic acids, about 50 to about 10⁶ nucleicacids, about 50 to about 10⁵ nucleic acids or about 50 to about 10⁴nucleic acids. In yet other embodiments, a bead comprises about 100 toabout 10¹¹ nucleic acids, about 100 to about 10¹⁰ nucleic acids, about100 to about 10⁹ nucleic acids, about 100 to about 10⁸ nucleic acids,about 100 to about 10⁷ nucleic acids, about 100 to about 10⁶ nucleicacids, about 100 to about 10⁵ nucleic acids or about 100 to about 10⁴nucleic acids. In any of the above-described embodiments of the beads ormixtures of beads described herein, the nucleic acids present at a beadcan be detected in aggregate. In some embodiments, the nucleic acids areassociated with the bead. In other embodiments, the nucleic acids areattached at the bead.

Additional embodiments can be found as set forth in the numberedparagraphs below.

1. A method of detecting molecules, said method comprising the steps of(a) providing a solid support comprising molecules associated with asite on the solid support such that the molecules are detected inaggregate during a detection step, wherein the site comprises at leasttwo different types of molecules; (b) detecting a signal correspondingto the aggregate of molecules at the site; (c) estimating the fractionof different types of molecules at the site or estimating the amount ofsignal corresponding to different types of molecules at the site; (d)calculating the amount of signal corresponding to different types ofmolecules at the site using the fraction estimate, thereby obtaining asignal estimate or calculating the fraction of different types ofmolecules at the site using the signal estimate, thereby obtaining afraction estimate; and (e) iteratively updating the fraction estimateand signal estimate until the estimates converge, thereby detectingmolecules associated with the site.

2. The method of paragraph 1, wherein the providing step furthercomprises providing a mixture of molecules to said solid support.

3. The method of paragraph 1, wherein the providing step furthercomprises attaching the molecules at the site.

4. The method of paragraph 1, wherein the estimating step is performedby guessing the fraction of different types of molecules at the site orguessing the amount of signal corresponding to different types ofmolecules at the site.

5. The method of paragraph 1, wherein the providing step furthercomprises associating the molecules with the site.

6. The method of paragraph 1, wherein the estimating step comprisesperforming a principal component analysis (PCA).

7. The method of paragraph 1, wherein the updating step comprisesperforming a numerical optimization algorithm.

8. The method of paragraph 7, wherein the numerical optimizationalgorithm is based on an iterative map search.

9. The method of paragraph 8, wherein the numerical optimizationalgorithm is based on Fienup's iteration map.

10. The method of paragraph 1, wherein sequence data is obtained for oneor more molecules.

11. The method of paragraph 10, wherein sequence data is obtained by asequencing-by-synthesis process.

12. The method of paragraph 11, wherein the sequencing-by-synthesisprocess comprises a pyrosequencing process.

13. The method of paragraph 1, wherein the solid support comprises abead.

14. The method of paragraph 1, wherein the solid support comprises aflow-cell.

15. The method of paragraph 1, wherein about 1,000 to about 10,000molecules are associated with the site.

16. The method of paragraph 1, wherein about 2,000 to about 8,000molecules are associated with the site.

17. The method of paragraph 1, wherein about 3,000 to about 6,000molecules are associated with the site.

18. The method of paragraph 1, wherein the molecules comprise nucleicacids.

19. The method of paragraph 18, wherein said nucleic acids are attachedat the site.

20. The method of paragraph 18, wherein the nucleic acids comprise afirst subpopulation of nucleic acids and a second subpopulation ofnucleic acids, said nucleic acids of the first subpopulation each havingan identical target region and the nucleic acids of the secondsubpopulation each having an identical region that is a variant of thetarget region.

21. The method of paragraph 18, wherein the nucleotide sequence of saidtarget region of the nucleic acids of the first subpopulation has atleast 1 nucleotide that is different as compared to the nucleotidesequence of said variant of the target region of the nucleic acids ofthe second subpopulation.

22. The method of paragraph 18, wherein the nucleotide sequence of saidtarget region of the nucleic acids of the first subpopulation has atleast 3 nucleotides that are different as compared to the nucleotidesequence of said variant of the target region of the nucleic acids ofthe second subpopulation.

23. The method of paragraph 18, wherein a nucleotide sequence differencebetween the target region in the nucleic acids of the firstsubpopulation and the variant of the target region in the nucleic acidsof the second subpopulation comprises at least one difference selectedfrom the group consisting of a mutation, a polymorphism, an insertion, adeletion, a substitution, a simple tandem repeat polymorphism, and asingle nucleotide polymorphism (SNP).

24. The method of paragraph 18, wherein the nucleic acids comprisealleles of a genetic locus from a polyploid organism.

25. The method of paragraph 18, wherein the nucleic acids comprisealternative splicing forms of a nucleic acid.

26. The method of paragraph 18, the nucleic acids comprises alleles of agenetic locus from a diploid organism.

27. A molecule detection system comprising a solid support comprisingmolecules associated with a site on the solid support such that themolecules are detected in aggregate, wherein the molecules comprise atleast two different types of molecules; and a detector configured todetect said molecules associated with said site.

28. The system of paragraph 27, wherein the molecule detection systemfurther comprises a fluid handling system configured to apply fluid tosaid site.

29. The system of paragraph 27, wherein the molecule detection systemfurther comprises a light source configured to provide an excitationbeam to said site.

30. The system of paragraph 27, wherein the molecule detection systemfurther comprises a first data processing module configured to estimatethe fraction of different types of molecules at the site or the amountof signal corresponding to different types of molecules at the site.

31. The system of paragraph 30, wherein the molecule detection systemfurther comprises a second data processing module configured tocalculate the amount of signal corresponding to different types ofmolecules at the site using the fraction estimate or to calculate thefraction of different types of molecules at the site using the signalestimate.

32. The system of paragraph 31, wherein the molecule detection systemfurther comprises a third data processing module configured toiteratively update the fraction estimate and signal estimate.

33. The system of paragraph 27, wherein the molecule detection system isconfigured to identify the nucleotide sequence of a target region of anucleic acid.

34. The system of paragraph 27, wherein the site is a well.

35. The system of paragraph 27, wherein the site is a bead, said beadbeing present in a well of a multiwell substrate.

36. The system of paragraph 35, wherein the well further comprises beadshaving an enzyme attached thereto.

37. The system of paragraph 36, wherein said enzyme comprisessulfurylase.

38. The system of paragraph 36, wherein said enzyme comprisesluciferase.

39. The system of paragraph 36, wherein said enzyme comprises a separatesulfurylase enzyme and a separate luciferase enzyme.

40. The system of paragraph 35, wherein the well further comprises beadshaving neither a nucleic acid nor an enzyme attached thereto.

41. The system of paragraph 27, wherein the molecules are attached atthe site.

42. The system of paragraph 27, wherein about 1,000 to about 10,000molecules are associated with the site.

43. The system of paragraph 27, wherein about 2,000 to about 8,000molecules are associated with the site.

44. The system of paragraph 27, wherein about 3,000 to about 6,000molecules are associated with the site.

45. The system of paragraph 27, wherein the molecules comprise nucleicacids.

46. The system of paragraph 27, wherein the solid support comprises abead.

47. The system of paragraph 27, wherein the solid support comprises aflow-cell.

48. A method of identifying a target region of a nucleic acid, saidmethod comprising (a) associating a first subpopulation of nucleic acidswith a site on a solid support, wherein, nucleic acids of said firstsubpopulation comprise an identical target region; (b) associating asecond subpopulation of nucleic acids with the site on the solidsupport, wherein nucleic acids of said second subpopulation comprise anidentical target region that is a variant of the target region of thenucleic acids of said first subpopulation; (c) detecting a signalcorresponding to one or more nucleotides of the target region of firstsubpopulation nucleic acids and one or more nucleotides of the variantof the target region of second subpopulation nucleic acids; (d)estimating the fraction of first subpopulation nucleic acids and secondsubpopulation nucleic acids associated with the site or estimating theamount of signal corresponding to first subpopulation nucleic acids andsecond subpopulation nucleic acids associated with the site; (e)calculating the amount of signal corresponding to first subpopulationnucleic acids and second subpopulation nucleic acids associated with thesite using the fraction estimate, or calculating the fraction of firstsubpopulation nucleic acids and second subpopulation nucleic acidsassociated with the site using the signal estimate; and (f) iterativelyupdating the fraction estimate and signal estimate until the estimatesconverge, thereby identifying a target region of a nucleic acid.

49. The method of paragraph 48, wherein step (a) comprises attachingfirst subpopulation nucleic acids and second subpopulation nucleic acidsto the solid support.

50. The method of paragraph 48, wherein step (d) comprises performing aprincipal component analysis (PCA).

51. The method of paragraph 48, wherein step (f) comprises performing anumerical optimization algorithm.

52. The method of paragraph 51, wherein the numerical optimizationalgorithm is based on iterative map search.

53. The method of paragraph 52, wherein the numerical optimizationalgorithm is based on Fienup's iteration map.

54. The method of paragraph 48, wherein sequence data is obtained fromboth first and second subpopulation nucleic acids.

55. The method of paragraph 54, wherein sequence data is obtained by asequencing-by-synthesis process.

56. The method of paragraph 55, wherein the sequencing-by-synthesisprocess comprises a pyrosequencing process.

57. The method of paragraph 48, wherein the solid support comprises abead.

58. The method of paragraph 48, wherein the solid support comprises aflow-cell.

59. The method of paragraph 48, wherein about 1,000 to about 10,000nucleic acids are associated with the site.

60. The method of paragraph 48, wherein about 2,000 to about 8,000nucleic acids are associated with the site.

61. The method of paragraph 48, wherein about 3,000 to about 6,000nucleic acids are associated with the site.

62. The method of paragraph 48, wherein the nucleotide sequence of saidtarget region of first subpopulation nucleic acids has at least 1nucleotide that is different as compared to the nucleotide sequence ofsaid variant of the target region of second subpopulation nucleic acids.

63. The method of paragraph 48, wherein the nucleotide sequence of saidtarget region of first subpopulation nucleic acids has at least 3nucleotides that are different as compared to the nucleotide sequence ofsaid variant of the target region of second subpopulation nucleic acids.

64. The method of paragraph 48, wherein a nucleotide sequence differencebetween the target region in first subpopulation nucleic acids of andthe variant of the target region in second subpopulation nucleic acidscomprises at least one difference selected from the group consisting ofa mutation, a polymorphism, an insertion, a deletion, a substitution, asimple tandem repeat polymorphism, and a single nucleotide polymorphism(SNP).

65. The method of paragraph 48, wherein first subpopulation nucleicacids and second subpopulation nucleic acids comprise alleles of agenetic locus from a polyploid organism.

66. The method of paragraph 48, wherein first subpopulation nucleicacids and second subpopulation nucleic acids comprise alternativesplicing forms of a nucleic acid.

67. The method of paragraph 48, wherein first subpopulation nucleicacids and second subpopulation nucleic acids comprise alleles of agenetic locus from a diploid organism.

68. A method for identifying a biosignature, said method comprising thesteps of (a) providing samples obtained from a plurality of subjects,wherein the samples comprise molecules; (b) tagging molecules from thesamples so as to identify the subject from which each sample originated;(c) associating molecules from the samples with a site on a solidsupport such that the molecules are detected in aggregate during adetection step, wherein the site comprises at least two different typesof molecules; (d) obtaining a biosignature for molecules associated withthe site by i) detecting a signal corresponding to the aggregate of themolecules at the site; ii) estimating the fraction of different types ofmolecules at the site or the amount of signal corresponding to differenttypes of molecules at the site; iii) calculating the amount of signalcorresponding to different types of molecules at the site using thefraction estimate, or calculating the fraction of different types ofmolecules at the site using the signal estimate; and iv) iterativelyupdating the fraction estimate and signal estimate until the estimatesconverge, thereby obtaining a biosignature for molecules at the site;and (e) comparing the biosignature obtained in step (d) to a referencebiosignature, thereby identifying said bio signature.

69. The method of paragraph 68, wherein the molecules are attached atthe site.

70. The method of paragraph 70, wherein the molecules comprise nucleicacids.

71. The method of paragraph 70, wherein said nucleic acids comprises amarker from a pathogen.

72. The method of paragraph 71, wherein the pathogen comprises apathogen selected from the group consisting of a virus, a bacterium anda eukaryotic cell.

73. The method of paragraph 72, wherein the eukaryotic cell comprises acancer cell.

74. The method of paragraph 68, wherein the sample comprises an abnormalcell type.

75. The method of paragraph 68, wherein the sample comprises a mixtureof eukaryotic cell types, a mixture of microorganisms or a mixture ofboth eukaryotic cell types and microorganisms.

76. The method of paragraph 68, wherein the subject is not livingmaterial.

77. The method of paragraph 68, wherein the sample comprises humanflora.

78. The method of paragraph 77, wherein the human flora are selectedfrom the group consisting of skin flora, nasal flora, gut flora, vaginalflora, and oral cavity flora.

79. The method of paragraph 68, wherein the sample is obtained from acancer patient.

80. A solid support comprising a population of nucleic acids associatedwith a site on said solid support such that nucleic acids of saidpopulation of nucleic acids are detected in aggregate, said populationof nucleic acids comprising a first subpopulation and a secondsubpopulation, wherein nucleic acids of the first subpopulation comprisean identical target region and nucleic acids of the second subpopulationcomprise an identical region that is a variant of the target region.

81. The solid support of paragraph 80, wherein said population ofnucleic acids is associated with said site.

82. The solid support of paragraph 81, wherein about 1,000 to about10,000 nucleic acids are associated with said site.

83. The solid support of paragraph 81, wherein about 2,000 to about8,000 nucleic acids are associated with said site.

84. The solid support of paragraph 81, wherein about 3,000 to about6,000 nucleic acids are associated with said site.

85. The solid support of paragraph 80, wherein a nucleotide sequencedifference between the target region in first subpopulation nucleicacids and the variant of the target region in second subpopulationnucleic acids comprises at least one difference selected from the groupconsisting of a mutation, a polymorphism, an insertion, a deletion, asubstitution, a simple tandem repeat polymorphism, and a singlenucleotide polymorphism (SNP).

86. The solid support of paragraph 80, wherein said population ofnucleic acids comprises alleles of a genetic locus from a polyploidorganism.

87. The solid support of paragraph 80, wherein said population ofnucleic acids comprises alternative splicing forms of a nucleic acid.

88. The solid support of paragraph 80, wherein said population ofnucleic acids comprises alleles of a genetic locus from a diploidorganism.

89. A mixture of beads comprising a plurality of beads, each bead ofsaid plurality of beads comprising a first subpopulation and a secondsubpopulation of nucleic acids, wherein said first subpopulation andsaid second subpopulation of nucleic acids are associated with the beadsuch that they are detected in aggregate, wherein said nucleic acids ofthe first subpopulation each comprise an identical target region andsaid nucleic acids of the second subpopulation each comprise anidentical region that is a variant of the target region.

90. The mixture of beads of paragraph 89, wherein said nucleic acids areattached to each bead of said plurality of beads.

91. The mixture of beads of paragraph 89, wherein said plurality ofbeads is distributed on a substrate.

92. The mixture of beads of paragraph 89 distributed on a substrate.

93. The mixture of beads of paragraph 92, wherein said distribution onthe substrate is a random distribution.

94. The mixture of beads of paragraph 92, wherein said substratecomprises wells and said beads are distributed in said wells.

95. The mixture of beads of paragraph 94, wherein wells of the substratefurther comprise beads having an enzyme attached thereto.

96. The mixture of beads of paragraph 95, wherein said enzyme comprisessulfurylase.

97. The mixture of beads of paragraph 95, wherein said enzyme comprisesluciferase.

98. The mixture of beads of paragraph 95, wherein wells of the substratefurther comprise beads having neither a nucleic acid nor enzyme attachedthereto.

99. A bead comprising a first subpopulation of capture nucleic acidshaving a competitor molecule hybridized thereto and a secondsubpopulation of capture nucleic acids comprising a region that permitshybridization of a complementary molecule.

100. A bead comprising capture nucleic acids hybridized with anamplified nucleic acid comprising a degenerate tag, said degenerate tagbeing hybridized to a capture nucleic acid, wherein said bead is presentin a channel of a substrate or wherein said bead is present in a well ofa multiwell substrate, said well being configured to hold a single beadhaving said amplified nucleic acid hybridized thereto.

DETAILED DESCRIPTION

Aspects of the present invention relate to methods, systems andcompositions for detecting multiple types of molecules associated with asolid support. Some of the methods described herein relate to detectingmolecules associated with a site on a solid support, where the sitecomprises at least two different types of molecules. In someembodiments, the molecules are associated with a site on a solid supportsuch that the molecules are detected in aggregate during detection.Biomolecules, such as proteins and nucleic acids, can be detected by themethods described herein. Some embodiments of these methods can beemployed to detect sequences of nucleic acids associated with a site onthe solid support. In some embodiments, the nucleic acids are attachedat a site on the solid support. Although the detection and evaluation ofmolecules in aggregate is exemplified herein for embodiments in whichthe molecules are associated with a solid support, it will be understoodthat the methods and compositions can also be used for embodimentswherein the aggregation of molecules is not bound to a solid support,for example, being in solution phase.

In some embodiments of the above-described methods, detecting themolecules in aggregate involves detecting a signal corresponding to theaggregate of molecules at a site on a solid support. The aggregatesignal can then be deconvoluted using processes described herein, whichcan be implemented using one or more data processors, such as one ormore computers. In some embodiments, one step in deconvoluting theaggregate signal involves estimating the fraction of different types ofmolecules at the site. Alternatively or additionally, the step caninvolve estimating the amount of signal corresponding to different typesof molecules at the site. In some embodiments, the variation associatedwith one or both estimates is determined. In some embodiments, theamount of signal corresponding to different types of molecules at thesite is calculated based on, or otherwise using, the variationassociated with the fraction estimate, thereby obtaining a signalestimate. Alternatively or additionally, the fraction of different typesof molecules at the site is calculated based on, or otherwise using, thevariation associated with the signal estimate, thereby obtaining afraction estimate. In further embodiments, the fraction and signalestimates are iteratively updated until the estimates converge. At oraround convergence, the estimates represent a solution set that can beused to determine the types of different molecules associated with thesite, thereby detecting molecules associated with the site.

It will be appreciated that, in preferred embodiments, a first signaland/or first fraction estimate is obtained. A first signal estimate canbe used to estimate the fraction, which can be either a first orsubsequent fraction estimate depending on whether the fraction has beenpreviously estimated. Similarly, a first fraction estimate can be usedto estimate the signal, which can be either a first or subsequent signalestimate depending on whether the signal has been previously estimated.In preferred embodiments, the fraction and signal estimates areiteratively updated by applying one or more algorithms to determine theconvergence of the estimates.

Some of the systems described herein relate to a molecule detectingsystem comprising a solid support comprising molecules associated with asite on the support such that the molecule are detected in aggregate,where the molecules comprise at least two different types of molecules;and a detector configured to detect the molecules associated with thesite.

Some embodiments of the above-described molecule detection systems canfurther include a first data processing module configured to estimatethe fraction of different types of molecules at the site. Alternativelyor additionally, the first data processing module, or another dataprocessing module, can be configured to estimate the amount of signalcorresponding to different types of molecules at the site. In someembodiments, the first data processing module is also configured todetermine the variation associated with one or both of the estimates. Inother embodiments, a separate data processing module is configured todetermine the variation associated with one or both of the estimates. Insome embodiments, the systems can further include a second dataprocessing module configured to calculate the amount of signalcorresponding to different types of molecules at the site. In someembodiments, the amount of signal corresponding to different types ofmolecules at the site is calculated based on, or otherwise using, thevariation associated with the fraction estimate. Additionally oralternatively, a second data processing module can be configured tocalculate the fraction of different types of molecules at the site. Insome embodiments, the fraction of different types of molecules at thesite is calculated based on, or otherwise using, the variationassociated with the signal estimate. In preferred embodiments of suchsystems, the systems can further include a third data processing moduleconfigured to iteratively update the fraction estimate and signalestimate.

It will be appreciated that in some embodiments described herein, aplurality of data processing functions can be included together in oneor a few data processing modules. In other embodiments, data processingfunctions can be included separately in separate data processingmodules. It will also be appreciated that a first data processing moduleneed not be separate from a second data processing module. In someembodiments, a first data processing module, a second data processingmodule, as well as third and subsequent data processing modules, arepart of the same data processing module.

Other methods described herein relate to identifying a target region ofa nucleic acid. In such embodiments, two subpopulations of nucleic acidsare associated with a site on a solid support. Nucleic acids of thefirst subpopulation include a first target region that is the same forsuch nucleic acids. Nucleic acids of the second subpopulation include atarget region that is the same for such nucleic acids but which is avariant of the first target region. In some embodiments, nucleic acidsequencing is performed on the nucleic acids to generate a signalcorresponding to one or more nucleotides of the target region of firstsubpopulation nucleic acids and one or more nucleotides of the variantof the target region of second subpopulation nucleic acids. This signalcan be deconvoluted as described above, and further herein, to obtainsequence data for the target region and the variant of the targetregion. Sequencing can be performed using methods known in the art,including but not limited to, a sequencing by hybridization process, asequencing by ligation process, a sequencing by exonucleolysis (forexample, an exonucleolytic-nanopore process), or asequencing-by-synthesis process (for example, a pyrosequencing process).

In some embodiments of the above-described methods for identifying atarget region of a nucleic acid, signal and/or fraction estimates forthe first and second subpopulations of nucleic acids can be determined.In some embodiments, the amount of signal corresponding to firstsubpopulation nucleic acids and second subpopulation nucleic acidsassociated with the site is calculated based on, or otherwise using, thefraction estimate. Alternatively or additionally, the fraction of firstsubpopulation nucleic acids and second subpopulation nucleic acidsassociated with the site is calculated based on, or otherwise using, thesignal estimate. In preferred embodiments, the fraction estimate andsignal estimates are iteratively updated until the estimates converge.In such a way, the target region of the nucleic acids of thesubpopulations can be identified.

Additional methods described herein relate to identifying abiosignature. In some embodiments of such methods, samples from aplurality of subjects are investigated. Such samples comprise molecules,such as nucleic acids and proteins. In a preferred embodiment, at leastsome of the molecules from the samples are tagged to permitidentification of the subject from which each sample originated.However, addition of an extrinsic tag is optional. In some embodimentstags that are intrinsic to the molecules, such as distinguishablenucleotide sequences in the case of nucleic acid molecules, can be used.In some embodiments, the molecules are then associated with a site on asolid support such that the molecules are detected in aggregate during adetection step. As discussed in connection with other methods describedherein, the site comprises at least two different types of molecules.

In some embodiments, a biosignature is obtained for a population orsubpopulation of molecules associated with the site. In certainembodiments, obtaining a biosignature is performed by detecting a signalcorresponding to the aggregate of the molecules at the site and thendeconvoluting the aggregate signal as described in connection with themethods set out above and elsewhere herein. In a preferred embodiment,the biosignature that is obtained is compared to a referencebiosignature, thereby permitting identification of the biosignature.

Some of the compositions described herein relate to a solid supportcomprising a population of nucleic acids associated with a site on thesolid support such that nucleic acids of the population of nucleic acidsare detected in aggregate. In some embodiments, the population ofnucleic acids comprises a first subpopulation and a secondsubpopulation, wherein nucleic acids of the first subpopulation comprisean identical target region and nucleic acids of the second subpopulationcomprise an identical region that is a variant of the target region.

Other compositions described herein relate to mixtures of beadscomprising a plurality of beads. In some embodiments, each bead of theplurality of beads comprises a first subpopulation and a secondsubpopulation of nucleic acids, wherein the first subpopulation and thesecond subpopulation of nucleic acids are associated with the bead suchthat they are detected in aggregate. In some embodiments, the nucleicacids of the first subpopulation each comprise an identical targetregion and the nucleic acids of the second subpopulation each comprisean identical region that is a variant of the target region. The variantcan be one or more nucleotides of a sequence region. In particularembodiments the variant is a single nucleotide such as the type presentin single nucleotide polymorphism (SNP).

Other compositions described herein relate to beads comprising a firstsubpopulation of capture nucleic acids having a competitor moleculehybridized thereto and a second subpopulation of capture nucleic acidscomprising a region that permits hybridization of a complementarymolecule.

Still other compositions described herein relate to beads comprisingcapture nucleic acids hybridized with an amplified nucleic acidcomprising a degenerate tag. In some embodiments, a degenerate tag ishybridized to a capture nucleic acid, wherein the bead is present in achannel of a substrate. In some embodiments, the bead is present in awell of a multiwell substrate. In some embodiments, the well isconfigured to hold a single bead having the amplified nucleic acidhybridized thereto.

DEFINITIONS

As used herein, “oligonucleotide,” “polynucleotide,” “nucleic acid”and/or grammatical equivalents thereof can refer to at least twonucleotide monomers linked together. A nucleic acid can generallycontain phosphodiester bonds, however, in some embodiments, nucleic acidanalogs may have other types of backbones, comprising, for example,phosphoramide (Beaucage, et al., Tetrahedron, 49:1925 (1993); Letsinger,J. Org. Chem., 35:3800 (1970); Sprinzl, et al., Eur. J. Biochem., 81:579(1977); Letsinger, et al., Nucl. Acids Res., 14:3487 (1986); Sawai, etal., Chem. Lett., 805 (1984), Letsinger, et al., J. Am. Chem. Soc.,110:4470 (1988); and Pauwels, et al., Chemica Scripta, 26:141 (1986),incorporated by reference in their entireties), phosphorothioate (Mag,et al., Nucleic Acids Res., 19:1437 (1991); and U.S. Pat. No.5,644,048), phosphorodithioate (Briu, et al., J. Am. Chem. Soc.,111:2321 (1989), incorporated by reference in its entirety),O-methylphophoroamidite linkages (see Eckstein, Oligonucleotides andAnalogues: A Practical Approach, Oxford University Press, incorporatedby reference in its entirety), and peptide nucleic acid backbones andlinkages (see Egholm, J. Am. Chem. Soc., 114:1895 (1992); Meier, et al.,Chem. Int. Ed. Engl., 31:1008 (1992); Nielsen, Nature, 365:566 (1993);Carlsson, et al., Nature, 380:207 (1996), incorporated by reference intheir entireties). In some embodiments, oligonucleotide, polynucleotideand/or nucleic acid refer to at least five nucleotides linked together,at least 10 nucleotides linked together, at least 15 nucleotides linkedtogether, at least 20 nucleotides linked together or at least 25nucleotides linked together.

Other analog nucleic acids include those with positive backbones(Denpcy, et al., Proc. Natl. Acad. Sci. USA, 92:6097 (1995),incorporated by reference in its entirety); non-ionic backbones (U.S.Pat. Nos. 5,386,023; 5,637,684; 5,602,240; 5,216,141; and 4,469,863;Kiedrowshi, et al., Angew. Chem. Intl. Ed. English, 30:423 (1991);Letsinger, et al., J. Am. Chem. Soc., 110:4470 (1988); Letsinger, etal., Nucleosides & Nucleotides, 13:1597 (1994); Chapters 2 and 3, ASCSymposium Series 580, “Carbohydrate Modifications in AntisenseResearch”, Ed. Y. S. Sanghui and P. Dan Cook; Mesmaeker, et al.,Bioorganic & Medicinal Chem. Lett., 4:395 (1994); Jeffs, et al., J.Biomolecular NMR, 34:17 (1994); Tetrahedron Lett., 37:743 (1996),incorporated by reference in their entireties) and non-ribose (U.S. Pat.Nos. 5,235,033 and 5,034,506, and Chapters 6 and 7, ASC Symposium Series580, “Carbohydrate Modifications in Antisense Research”, Ed. Y. S.Sanghui and P. Dan Coo, incorporated by reference in their entireties).Nucleic acids may also contain one or more carbocyclic sugars (seeJenkins, et al., Chem. Soc. Rev., (1995) pp. 169 176).

Modifications of the ribose-phosphate backbone may be done to facilitatethe addition of additional moieties such as labels, or to increase thestability of such molecules under certain conditions. In addition,mixtures of naturally occurring nucleic acids and analogs can be made.Alternatively, mixtures of different nucleic acid analogs, and mixturesof naturally occurring nucleic acids and analogs may be made. Thenucleic acids may be single stranded or double stranded, as specified,or contain portions of both double stranded or single stranded sequence.The nucleic acid may be DNA, for example, genomic or cDNA, RNA or ahybrid. A nucleic acid can contain any combination of deoxyribo- andribo-nucleotides, and any combination of bases, including uracil,adenine, thymine, cytosine, guanine, inosine, xanthanine,hypoxanthanine, isocytosine, isoguanine, and base analogs such asnitropyrrole (including 3-nitropyrrole) and nitroindole (including5-nitroindole), etc.

In some embodiments, a nucleic acid can include at least one promiscuousbase. Promiscuous bases can base-pair with more than one different typeof base. In some embodiments, a promiscuous base can base-pair with atleast two different types of bases and no more than three differenttypes of bases. An example of a promiscuous base includes inosine thatmay pair with adenine, thymine, or cytosine. Other examples includehypoxanthine, 5-nitroindole, acylic 5-nitroindole, 4-nitropyrazole,4-nitroimidazole and 3-nitropyrrole (Loakes et al., Nucleic Acid Res.22:4039 (1994); Van Aerschot et al., Nucleic Acid Res. 23:4363 (1995);Nichols et al., Nature 369:492 (1994); Berstrom et al., Nucleic AcidRes. 25:1935 (1997); Loakes et al., Nucleic Acid Res. 23:2361 (1995);Loakes et al., J. Mol. Biol. 270:426 (1997); and Fotin et al., NucleicAcid Res. 26:1515 (1998), incorporated by reference in theirentireties). Promiscuous bases that can base-pair with at least three,four or more types of bases can also be used.

As used herein, “nucleotide” and/or grammatical equivalents thereof canrefer to a nucleotide and/or nucleotide analog. In some embodiments,nucleotides can become incorporated into a polynucleotide. In someembodiments, nucleotides may be substrates for an enzyme that can extenda polynucleotide strand. Nucleotides may or may not become incorporatedinto a nascent polynucleotide in template-based polynucleotidesynthesis. Nucleotides may or may not contain labels and/or terminators.In some embodiments, terminators include reversibly terminatingmoieties. Incorporation of a nucleotide comprising a reversibleterminator can inhibit extension of the polynucleotide; however, themoiety can be removed and the polynucleotide may then be extendedfurther. Such reversible terminators are known in the art examples ofwhich are described in U.S. Pat. No. 7,541,444; U.S. Pat. No. 7,057,026;U.S. Pat. No. 7,414,116; U.S. Pat. No. 7,427,673; U.S. Pat. No.7,566,537; U.S. Pat. No. 7,592,435 and WO 07/135368, each of which isincorporated herein by reference in its entirety. In some embodiments, anucleotide may comprise both a label and a terminator. Examples ofnucleotides include deoxyribonucleotides, modified deoxyribonucleotides,ribonucleotides, modified ribonucleotides, peptide nucleotides, modifiedpeptide nucleotides, modified phosphate sugar backbone nucleotides andmixtures thereof. Nucleotide analogs which include a modified nucleobasecan also be used in the methods described herein. As is known in theart, certain nucleotide analogues cannot become incorporated into apolynucleotide, for example, nucleotide analogues such as adenosine 5′phosphosulfate. In some embodiments, the bases are promiscuous bases.

As used herein, “a portion” and/or grammatical equivalents thereof canrefer to any fraction of a whole amount. In some embodiments, the termportion may be applied to any substance or process that has boundaries,a particular number of items or a beginning and an end. For example, insome embodiments, the term portion can be applied to nucleic acids,target regions of nucleic acids, variants of target regions of nucleicacids, samples, populations, subpopulations, solid supports, sites,beads and process steps. In some embodiments, “at least a portion” canrefer to at least about 1%, at least about 2%, at least about 3%, atleast about 4%, at least about 5%, at least about 6%, at least about 7%,at least about 8%, at least about 9%, at least about 10%, at least about15%, at least about 20%, at least about 25%, at least about 30%, atleast about 35%, at least about 40%, at least about 45%, at least about50%, at least about 55%, at least about 60%, at least about 65%, atleast about 70%, at least about 75%, at least about 80%, at least about85%, at least about 90%, at least about 95%, or at least about 99% ofthe whole amount. In some embodiments, a portion can refer to anyfraction of a whole amount that is from 0.1% to 99.9%. In someembodiments, a portion can refer to any fraction of a whole amount thatis from 1% to 99%. In some embodiments, a portion can refer to anyfraction of a whole amount that is from 2% to 98%. In some embodiments,a portion can refer to any fraction of a whole amount that is from 5% to95%. In some embodiments, a portion can refer to any fraction of a wholeamount that is from 10% to 90%. In some embodiments, a portion can referto any fraction of a whole amount that is from 20% to 90%. In someembodiments, a portion refer to any fraction of a whole amount that isfrom 30% to 90%.

As used herein, “fraction” and/or grammatical equivalents thereof meansany part of a whole composition or process, the absence of a wholecomposition or process or the presence of the whole composition orprocess. For example, in some embodiments, the term fraction can beapplied to populations and/or subpopulations of molecules. In some suchembodiments, where molecules are associated with a site on a solidsupport, fraction can be used to refer to a part of the population orsubpopulation of molecules present at the site. For example, if threedifferent types of molecules are thought to be present at a site, thenthe population of molecules present at the site can be divided intothree fractions, each of which represents a particular type of moleculein the population or subpopulation.

As used herein, “detected in aggregate” and/or grammatical equivalentsthereof can refer to the manner in which the molecules at a site on asolid support are detected. In some embodiments, “detected in aggregate”means that molecules present at a site, which are separately associatedwith or separately attached at the site, are detected together. In someembodiments, the molecules need not be separately associated with orseparately attached at the site. In some embodiments, the entirepopulation of molecules at the site is detected. In other embodiments,only a portion of the entire population of molecules at the site isactually detected.

In some embodiments, detection may be indirect as in pyrosequencing. Forexample, a detectable signal can be produced by a molecule that is inproximity of the molecule to be detected but which is not attached tothe detected molecule. In other embodiments, detection can be direct asin the case of a label attached to a molecule associated with a site.For example, in some embodiments, a labeled molecule can be incorporatedinto a nucleic acid in sequencing by synthesis applications. In someembodiments, direct detection can occur even if the label is attached tothe molecule through one or more intermediary molecules.

As used herein, the term “complementary” and/or grammatical equivalentsthereof refer to the nucleotide base-pairing interaction of one nucleicacid with another nucleic acid that results in the formation of aduplex, triplex, or other higher-ordered structure. In some embodiments,the nucleic acids are similar enough in complementarity betweensequences to permit hybridization under various stringency conditions.As will be appreciated by persons skilled in the art, stringentconditions are sequence-dependent and are different in differentcircumstances. For example, longer fragments may require higherhybridization temperatures for specific hybridization than shortfragments. Because other factors, such as base composition and length ofthe complementary strands, presence of organic solvents, and the extentof base mismatching, may affect the stringency of hybridization, thecombination of parameters can be more important than the absolutemeasure of any one parameter alone. In some embodiments, hybridizationcan be made to occur under high stringency conditions, such as hightemperatures or 0.1×SCC. Examples of high stringent conditions are knownin the art; see for example Sambrook et al., Molecular Cloning: ALaboratory Manual, 2d Edition, 1989, and Short Protocols in MolecularBiology, ed. Ausubel et al., both of which are hereby incorporated byreference. In general, increasing the temperature at which thehybridization is performed increases the stringency. As such, thehybridization reactions described herein can be performed at a differenttemperature depending on the desired stringency of hybridization.Hybridization temperatures can be as low as, or even lower than, 5° C.,but are typically greater than 22° C., and more typically greater thanabout 30° C., and even more typically in excess of 37° C. In otherembodiments, the stringency of the hybridization can further be alteredby the addition or removal of components of the buffered solution. Insome embodiments, hybridization is permitted under medium stringencyconditions. In other embodiments, hybridization is permitted under lowstringency conditions. In some embodiments, a nucleic acid sequence isperfectly complementary to a capture nucleic acid or other molecule withwhich it binds. In other embodiments, one or more mismatches are presentbetween the hybridized molecules or hybridized portions of molecules.

As used herein the term “type of molecule” and/or grammaticalequivalents thereof are intended to refer to any grouping of moleculesthat can be based on either natural or artificial criteria. For example,in some embodiments, two nucleic acids having an identical sequence butfor a single nucleotide difference can be considered different types ofmolecules. In other embodiments, polypeptides having identical aminoacid sequence but for a single amino acid difference can be considereddifferent types of molecules. In some embodiments, difference may bebased on differences in sequence length or other higher level structuralarrangements including, but not limited to, secondary, tertiary andquaternary structures. In some embodiments, type of molecule can referto the class of molecule. For example, nucleic acids and proteins aredifferent classes of molecules, and thus, can be considered differenttypes of molecules.

In some embodiments, a portion of a molecule can be used to characterizemolecules as different types of molecules. For example, a portion of themolecule used to characterize a molecule can be a portion of the primarysequence of a polymeric molecule, such as a nucleotide or amino acidsequence, or a portion of another structural feature, such as nucleicacid or polypeptide secondary structure. In some embodiments, theportion of the molecule is a stretch or region of contiguous ornoncontiguous nucleotide sequence. In some embodiments, “a targetregion” of a molecule, such as a nucleic acid, is a portion of themolecule, which is less than the entirety of the nucleotide sequence. Insuch embodiments, the target region of a nucleic acid can becharacterized with respect to a region of another nucleic acid as eitheridentical or a variant. In such embodiments, if the nucleotide sequenceof the target region of a first nucleic acid is identical to a region ofa second nucleic acid, then the molecules can be characterized as thesame type of molecule. If the nucleotide sequence of the target regionof a first nucleic acid has one or more nucleotide differences withrespect to a region of a second nucleic acid, then the molecules can becharacterized as different types of molecules. In such embodiments, theregion of the second nucleic acid is referred to as a “variant of thetarget region.”

As used herein, the term “estimating,” “approximating” and/orgrammatical equivalents thereof include, but are not limited to,estimating based on no knowledge about the particular system beinganalyzed or similar systems, for example, guessing or random numbergeneration. In some embodiments, estimating can be based on someknowledge about the particular system being analyzed or similar systems.For example, principle component analysis of the system underinvestigation or a similar system can provide information that can beused to generate numerical estimates that may be closer to the actualnumeric values of one or more parameters being investigated inconnection with a particular system.

As used herein, the term “data processing module” and/or grammaticalequivalents thereof refer to a module that processes data. A dataprocessing module can be implemented in hardware, software or acombination of both. In some embodiments, multiple data processingmodules can be combined. In some embodiments, the functionality ofmultiple data processing modules is implemented in a single dataprocessing module. In other embodiments, data processing modules areseparate. In some embodiments, data processing instructions included inone or more data processing modules can be executed by a single CPU. Inother embodiments, such instructions may be implemented by multipleCPUs.

As used herein, the term “site” and/or grammatical equivalents thereofrefer to a location on a solid support. In some embodiments, a siterefers to a location on a solid support where molecules are associatedtogether in close proximity. In some embodiments, the molecules arepresent at the site in such a fashion that signal produced in thedetection of the molecules is detected as an aggregate signal. Forexample, the resolution of the detection system may be at a level thatcan only detect an aggregate signal form a site and cannot distinguishindividual signals from different molecules at the site. In someembodiments, the signals corresponding to molecules of different typesthat are associated with a site cannot be adequately resolved from eachother. In some embodiments, a site is a specific area on a large solidsupport, such as a site on a solid substrate, such as a flow cell. Inother embodiments, a site includes the entire solid support as in thecase of certain beads, such as microbeads. As such, in some embodiments“a site on a solid support” can refer to a portion of a solid support,whereas in other embodiments, a site on a solid support can refer to theentire solid support.

In some embodiments of the methods, systems and compositions describedherein, the term “site” means a feature of a microarray. In someembodiments, a site has a certain number of molecules associatedtherewith. In other embodiments, a site has a number of molecules withina specified range associated therewith. For example, in someembodiments, a site includes about 2 to about 10¹¹ molecules, about 2 toabout 10¹⁰ molecules, about 2 to about 10⁹ molecules, about 2 to about10⁸ molecules, about 2 to about 10⁷ molecules, about 2 to about 10⁶molecules, about 2 to about 10⁵ molecules or about 2 to about 10⁴molecules. In other embodiments, a site includes about 10 to about 10¹¹molecules, about 10 to about 10¹⁰ molecules, about 10 to about 10⁹molecules, about 10 to about 10⁸ molecules, about 10 to about 10⁷molecules, about 10 to about 10⁶ molecules, about 10 to about 10⁵molecules or about 10 to about 10⁴ molecules. In still otherembodiments, the site includes about 50 to about 10¹¹ molecules, about50 to about 10¹⁰ molecules, about 50 to about 10⁹ molecules, about 50 toabout 10⁸ molecules, about 50 to about 10⁷ molecules, about 50 to about10⁶ molecules, about 50 to about 10⁵ molecules or about 50 to about 10⁴molecules. In yet other embodiments, a site includes about 100 to about10¹¹ molecules, about 100 to about 10¹⁰ molecules, about 100 to about10⁹ molecules, about 100 to about 10⁸ molecules, about 100 to about 10⁷molecules, about 100 to about 10⁶ molecules, about 100 to about 10⁵molecules or about 100 to about 10⁴ molecules associated therewith.

In a preferred embodiment, the initial number of molecules associatedwith a site ranges from about 10 to about 1000 molecules. In anotherpreferred embodiment, about 10 to about 500 molecules are initiallyassociated with the site. In yet another preferred embodiment, about 10to about 100 molecules are initially associated with the site.

In some embodiments, a site includes less area than the area of theentire solid support surface, such as a microarray surface, wheremolecules are associated. In certain embodiments, a site includes lessthan 90%, less than 80%, less than 70%, less than 60%, less than 50%,less than 40%, less than 30%, less than 20%, less than 10%, less than5%, less than 4%, less than 3%, less than 2%, less than 1%, less than0.5%, less than 0.1%, less than 0.01%, less than 0.001%, less than0.0001%, less than 0.00001%, less than 10⁻⁶%, less than 10⁻⁷%, less than10⁻⁸%, less than 10⁻⁹%, less than 10⁻¹⁰% or less than 10⁻¹¹% of thetotality of the solid support surface, such as a microarray surface,where molecules are associated. In certain embodiments, a site includesless than 90%, less than 80%, less than 70%, less than 60%, less than50%, less than 40%, less than 30%, less than 20%, less than 10%, lessthan 5%, less than 4%, less than 3%, less than 2%, less than 1%, lessthan 0.5%, less than 0.1%, less than 0.01%, less than 0.001%, less than0.0001%, less than 0.00001%, less than 10⁻⁶%, less than 10⁻⁷%, less than10⁻⁸%, less than 10⁻⁹%, less than 10⁻¹⁰% or less than 10⁻¹¹% of thetotality of a demarcated area on an array substrate, whether physicallyor virtually demarcated, where molecules are associated.

In some embodiments, a site is a feature having its longest dimension inthe micron range, which can accommodate a plurality of nucleic acidsdetectable and/or resolvable by current optical imaging devices such asscanners. In some embodiments, the site is a feature of about 0.1 squaremicron, 0.2 square microns, 0.3 square microns, 0.4 square microns, 0.5square microns, 0.6 square microns, 0.7 square microns, 0.8 squaremicrons, 0.9 square microns, 1.0 square microns, 1.1 square microns, 1.2square microns, 1.3 square microns, 1.4 square microns, 1.5 squaremicrons, 1.6 square microns, 1.7 square microns, 1.8 square microns, 1.9square microns, 2 square microns, 3 square microns, 4 square microns, 5square microns, 6 square microns, 7 square microns, 8 square microns, 9square microns, 10 square microns, 11 square microns, 12 square microns,13 square microns, 14 square microns, 15 square microns, 20 squaremicrons, 25 square microns, 30 square microns, 35 square microns, 40square microns, 45 square microns, to about 50 square microns, or anysize in between any of the foregoing values.

In other embodiments, a site is a feature having its longest dimensionin the sub-micron range that accommodates a plurality of nucleic acidsdetectable and/or resolvable by current imaging devices such asscanners. In some embodiments, the site is a feature size of about 1square nanometer, 5 square nanometers, 10 square nanometers, 15 squarenanometers, 20 square nanometers, 25 square nanometers, 30 squarenanometers, 35 square nanometers, 40 square nanometers, 45 squarenanometers, 50 square nanometers, 55 square nanometers, 60 squarenanometers, 65 square nanometers, 70 square nanometers, 75 squarenanometers, 80 square nanometers, 85 square nanometers, 90 squarenanometers, 95 square nanometers, 100 square nanometers, 125 squarenanometers, 150 square nanometers, 175 square nanometers, 200 squarenanometers, 225 square nanometers, 250 square nanometers, 275 squarenanometers, 300 square nanometers, 325 square nanometers, 350 squarenanometers, 375 square nanometers, 400 square nanometers, 425 squarenanometers, 450 square nanometers, 475 square nanometers, 500 squarenanometers, 525 square nanometers, 550 square nanometers, 575 squarenanometers, 600 square nanometers, 625 square nanometers, 650 squarenanometers, 675 square nanometers, 700 square nanometers, 725 squarenanometers, 750 square nanometers, 775 square nanometers, 800 squarenanometers, 825 square nanometers, 850 square nanometers, 875 squarenanometers, 900 square nanometers, 925 square nanometers, 950 squarenanometers, 975 square nanometers to about 1000 square nanometers, orany size in between any of the foregoing values.

It will be understood that some embodiments contemplate a site that is afeature in the picometer range. Accordingly, in some embodiments, thesite is a feature of about 1 square picometer, 5 square picometers, 10square picometers, 15 square picometers, 20 square picometers, 25 squarepicometers, 30 square picometers, 35 square picometers, 40 squarepicometers, 45 square picometers, 50 square picometers, 55 squarepicometers, 60 square picometers, 65 square picometers, 70 squarepicometers, 75 square picometers, 80 square picometers, 85 squarepicometers, 90 square picometers, 95 square picometers, 100 squarepicometers, 125 square picometers, 150 square picometers, 175 squarepicometers, 200 square picometers, 225 square picometers, 250 squarepicometers, 275 square picometers, 300 square picometers, 325 squarepicometers, 350 square picometers, 375 square picometers, 400 squarepicometers, 425 square picometers, 450 square picometers, 475 squarepicometers, 500 square picometers, 525 square picometers, 550 squarepicometers, 575 square picometers, 600, 625 square picometers, 650square picometers, 675 square picometers, 700 square picometers, 725square picometers, 750 square picometers, 775 square picometers, 800square picometers, 825 square picometers, 850 square picometers, 875square picometers, 900 square picometers, 925 square picometers, 950square picometers, 975 square picometers to about 1000 squarepicometers, or any size in between any of the foregoing values.

As used herein, “biosignature” and/or grammatical equivalents thereofcomprises information that indicates the presence, absence and/oridentity of a molecule or a plurality of molecules in a population ofmolecules or a subpopulation of molecules. In some embodiments, abiosignature comprises information that indicates whether a single typeof molecule is present or absent in a population or subpopulation ofmolecules. In other embodiments, a biosignature comprises informationthat indicates whether different types of molecules are present orabsent in a population or subpopulation of molecules. In otherembodiments, a biosignature comprises information that indicates theidentity of a single type of molecule or a portion of a single type ofmolecule that is present or absent in a population or subpopulation ofmolecules. In other embodiments, a biosignature comprises informationthat indicates the identities of different types of molecules orportions of different types of molecules that are present or absent in apopulation or subpopulation of molecules. In still other embodiments, abiosignature comprises any combination of the above-describedinformation. In a preferred embodiment, a biosignature comprisesinformation that indicates the presence, absence and/or identity of aplurality of different types of molecules in a population of molecules,or a subpopulation of molecules, associated with a site on a solidsupport.

As used herein, “reference biosignature” means a biosignature that isknown to be a characteristic of a group of organisms or group of otherentities comprising biomolecules. In some embodiments, a referencebiosignature is a biosignature that is known to be a characteristic of aspecies or variety of organism.

By “capture probe,” “capture nucleic acid” and/or grammaticalequivalents thereof is meant a polynucleotide that is associated with asolid support and that is used to hybridize with nucleic acid having atleast a portion complimentary to the capture probe.

As used herein, “solid support” and/or grammatical equivalents thereofmeans any solid or semi-solid substrate to which molecules can beassociated. In some embodiments, a solid support is a solid substrate.In some embodiments, a solid support includes a plurality of sites. Insome embodiments, a solid support can comprises a bead or othermicroparticle. In other embodiments, a solid support comprises a flowchamber or flow cell. In some embodiments, the solid support comprisesone or more arrays or microarrays.

Nucleic Amplification

Embodiments of the methods, systems and compositions described hereincan be implemented with or without amplification of the moleculesassociated with a site on a solid support. In preferred embodiments, themolecules are amplified using standard techniques known in the art. Insome embodiments, the molecules are amplified prior to associating themolecules with the site on the solid support. In other embodiments, themolecules are amplified subsequent to associating the molecules with thesite on the solid support. In still other embodiments, the molecules areamplified both prior to and subsequent to associating the molecules withthe site on the solid support.

Nucleic acid amplification includes the process of amplifying orincreasing the numbers of a nucleic acid molecule template and/or of acomplement thereof that are present, by producing one or more copies ofthe template and/or or its complement. In the provided methods,amplification can be carried out by a variety of known methods underconditions including, but not limited to, thermocycling amplification orisothermal amplification. For example, methods for carrying outamplification are described in U.S. Publication No. 2009/0226975; WO98/44151; WO 00/18957; WO 02/46456; WO 06/064199; and WO 07/010251;which are incorporated by reference herein in their entireties. Briefly,in the provided methods, amplification can occur on a surface to whichone or more template nucleic acid molecules are attached. This type ofamplification can be referred to as solid phase amplification, whichwhen used in reference to nucleic acids, refers to any nucleic acidamplification reaction carried out on or in association with a surface(e.g., a solid support). Typically, all or a portion of the amplifiedproducts are synthesized by extension of an immobilized primer. Solidphase amplification reactions are analogous to standard solution phaseamplifications except that at least one of the amplification primers isimmobilized on a surface (e.g., a solid support).

Solid-phase amplification may involve a nucleic acid amplificationreaction including only one species of oligonucleotide primerimmobilized to a surface. Alternatively, the surface may have aplurality of first and second different immobilized oligonucleotideprimer species. Solid phase nucleic acid amplification reactionsgenerally include at least one of two different types of nucleic acidamplification, interfacial and surface (or bridge) amplification. Forinstance, in interfacial amplification the solid support includestemplate nucleic acid molecules that are each indirectly immobilized tothe solid support by hybridization to an immobilized oligonucleotideprimer, the immobilized primer may be extended in the course of apolymerase-catalyzed, template-directed elongation reaction (e.g.,primer extension) to generate an immobilized nucleic acid molecule thatremains attached to the solid support. After the extension phase, thenucleic acids (e.g., template and its complementary product) aredenatured such that the template nucleic acid molecules are releasedinto solution and each made available for hybridization to anotherimmobilized oligonucleotide primer. The template nucleic acid moleculesmay be made available in 1, 2, 3, 4, 5 or more rounds of primerextension or may be washed out of the reaction after 1, 2, 3, 4, 5 ormore rounds of primer extension.

In surface (or bridge) amplification, immobilized nucleic acid moleculeseach hybridize to an immobilized oligonucleotide primer. The immobilizednucleic acid molecule provides the template for a polymerase-catalyzed,template-directed elongation reaction (e.g., primer extension) extendingfrom the immobilized oligonucleotide primer. The resultingdouble-stranded products “bridge” the two primers and both strands arecovalently attached to the support. In the next cycle, followingdenaturation that yields a pair of single strands (the immobilizedtemplate and the extended-primer product) immobilized to the solidsupport, both immobilized strands can serve as templates for new primerextension.

Amplification methods can be used to produce clusters of immobilizednucleic acid molecules. For example, the methods can produce arrays ofnucleic acid clusters, analogous to those described in U.S. Pat. No.7,115,400; U.S. Publication No. 2005/0100900; WO 00/18957; and WO98/44151, which are incorporated by reference herein in theirentireties. A cluster is a plurality of nucleic acid molecules attachedto a site on a surface. A cluster can include a plurality of copies of asingle nucleic acid sequence or a plurality of copies of a plurality ofnucleic acid sequences. The nucleic acid molecules making up theclusters may be in a single or double stranded form.

The clusters can have different shapes, sizes and densities depending onthe conditions used. For example, clusters can have a shape that issubstantially round, multi-sided, donut-shaped or ring-shaped. Thediameter or maximum cross section of a cluster can be the same orsimilar as those set forth above for sites in general. The density ofclusters or other sites can be in the range of at least about 0.1/mm²,at least about 1/mm², at least about 10/mm², at least about 100/mm², atleast about 1,000/mm², at least about 10,000/mm² to at least about100,000/mm². Optionally, the clusters have a density of, for example,100,000/mm² to 1,000,000/mm² or 1,000,000/mm² to Ser. No.10/000,000/mm².

Detection of Multiple Molecules in Aggregate

Disclosed herein are methods that can be used for detecting in aggregatemultiple molecules that are associated with a site on a solid support,where the site comprises at least two different types of molecules. Insome embodiments, detecting multiple molecules in aggregate can includethe steps of detecting a signal corresponding to the aggregate ofmolecules at the site; estimating the fraction of different types ofmolecules at the site, or estimating the amount of signal correspondingto different types of molecules at the site; calculating the amount ofsignal corresponding to different types of molecules at the site usingthe fraction estimate, thereby obtaining a signal estimate orcalculating the fraction of different types of molecules at the siteusing the signal estimate, thereby obtaining a fraction estimate; anditeratively updating the fraction estimate and signal estimate until theestimates converge, thereby detecting molecules attached at the site.

In some embodiments, a mixture of molecules is provided to the solidsupport or to a site on the solid support. In some embodiments, themixture of molecules can include one type of molecule, for example,molecules having an identical target region. In other embodiments, themixture of molecules can include a plurality of molecule types, forexample, some molecules each having an identical target region and othermolecules having one or more variants of a target region.

In some embodiments, a mixture of molecules is associated with a site onthe solid support. Molecules can be associated with a solid supporteither directly or indirectly. In some embodiments, molecules arecoupled to a solid support without attaching them to the solid support,such as by dissolving or suspending the molecules in a droplet on thesurface of a solid support. In other embodiments, molecules can beassociated with a solid support by coupling the molecules to a firstsolid support, such as a bead, which itself is associated with a secondsolid support, such as an array surface, a well on an array or a flowchamber.

In some embodiments, the molecules are attached at a site on a solidsupport. The attachment can be direct bonding or adhesion to a surfaceof the solid support or indirect attachment. Examples of indirectattachment include, but are not limited to, attachment of the moleculesto the solid support via a linker molecule or a capture nucleic acid. Insome embodiments, indirect attachment can include attaching molecules toa first solid support, such as a bead, which itself is attached to asecond solid support, such as an array surface, a well on an array or aflow chamber.

In some embodiments, the multiple types of molecules at the site thatare detected in aggregate have a linear relationship, that is, thepresence of one type of molecule at the site does not influence othertype(s) of molecule at the site. For example, if there are two differenttypes of nucleic acids at the site, allele 1 and allele 2, the value ofsignal (fraction 1*allele 1+fraction 2*allele 2) is expected to besubstantially similar to the value of fraction 1*signal (allele1)+fraction 2*signal (allele 2). As another example, if the threedifferent types of nucleic acids at the site are allele 1, allele 2 andallele 3, the value of signal (fraction 1*allele 1+fraction 2*allele2+fraction 3*allele 3) is expected to be substantially similar to thevalue of fraction 1*signal (allele 1)+fraction 2*signal (allele2)+fraction 3*signal (allele 3). In some embodiments, a principalcomponent analysis (PCA) can be used to determine the linearity betweenthe different types of molecules at the site. PCA is known in the artand has been described by Jolliffe, I. T. Principal Component AnalysisSeries: Springer Series in Statistics, 2^(nd) ed., Springer, N.Y. 2002,the disclosure of which is incorporated herein by reference in itsentirety.

The methods described herein can detect more than two types of moleculesat a site in aggregate. In some embodiments, the site comprises at leastthree, four or five different types of molecules. In other embodiments,the site comprises at least six, seven or eight different types ofmolecules. In still other embodiments, the site comprises at least nine,ten, eleven, twelve, thirteen, fourteen, fifteen, or more differenttypes of molecules. In some embodiments, two different types ofmolecules are detected in aggregate. In other embodiments, three, fouror five different types of molecules are detected in aggregate. In stillother embodiments, six, seven or eight different types of molecules aredetected in aggregate. In yet other embodiments, nine, ten, eleven,twelve, thirteen, fourteen, fifteen, or more different types ofmolecules are detected in aggregate. Additionally or alternatively,there can be a cap to the number of different types of molecules at asite including, for example, at most about fifteen, fourteen, thirteen,twelve, eleven, ten, nine, eight, seven, six, five, four, three, or two.The number of different types of molecules at a site or detected in anaggregate can be within a range defined by the values exemplified aboveor can be outside of these ranges depending upon the embodimentemployed.

Different number of molecules can be associated with a site on a solidsupport. In some embodiments, about 2 to about 100,000 molecules areassociated with the site. In other embodiments, about 10 to about90,000, or about 100 to about 80,000, or about 500 to 70,000, or about600 to about 60,000, or about 600 to about 50,000, or about 700 to about40,000, or about 800 to about 30,000, or about 900 to about 20,000, orabout 1,000 to about 10,000, or about 2,000 to about 8,000, or about3,000 to about 6,000, or about 4,000 to about 5,000 molecules areassociated with the site.

In a preferred embodiment, the initial number of molecules associatedwith a site ranges from about 10 to about 1000 molecules. In anotherpreferred embodiment, about 10 to about 500 molecules are initiallyassociated with the site. In yet another preferred embodiment, about 10to about 100 molecules are initially associated with the site.

In some embodiments of the molecule detection methods described herein,a site includes a population of molecules ranging from about 2 to about10¹¹ molecules, about 2 to about 10¹⁰ molecules, about 2 to about 10⁹molecules, about 2 to about 10⁸ molecules, about 2 to about 10⁷molecules, about 2 to about 10⁶ molecules, about 2 to about 10⁵molecules or about 2 to about 10⁴ molecules. In other embodiments, asite includes about 10 to about 10¹¹ molecules, about 10 to about 10¹⁰molecules, about 10 to about 10⁹ molecules, about 10 to about 10⁸molecules, about 10 to about 10⁷ molecules, about 10 to about 10⁶molecules, about 10 to about 10⁵ molecules or about 10 to about 10⁴molecules. In still other embodiments, the site includes about 50 toabout 10¹¹ molecules, about 50 to about 10¹⁰ molecules, about 50 toabout 10⁹ molecules, about 50 to about 10⁸ molecules, about 50 to about10⁷ molecules, about 50 to about 10⁶ molecules, about 50 to about 10⁵molecules or about 50 to about 10⁴ molecules. In yet other embodiments,a site includes about 100 to about 10¹¹ molecules, about 100 to about10¹⁰ molecules, about 100 to about 10⁹ molecules, about 100 to about 10⁸molecules, about 100 to about 10⁷ molecules, about 100 to about 10⁶molecules, about 100 to about 10⁵ molecules or about 100 to about 10⁴molecules. In a preferred embodiment, the entire population of moleculespresent at a site is detected in aggregate. In another preferredembodiment, one or more portions of the population of molecules isdetected in aggregate.

In some embodiments, detection can occur by detecting a signal producedby the molecules, such as a signal produced by a label attached directlyto the molecule. In a preferred embodiment, the label is a fluorescentlabel. In other embodiments, detection can occur by detecting a signalproduced by a one or more labeled molecules hybridized to, or otherwiseattached to, the molecules. In some embodiments, detection can occur bydetecting a signal produced in close proximity to the molecules. Forexample, the signal may be a diffusible signal that is generated by theproduction of a particular detectable compound or by the production of aparticular enzymatic substrate that is converted into a detectablecompound. In some embodiments, the compound may be a diffusiblecompound.

In some embodiments of the methods described herein, a first estimate ismade. In some embodiments, the first estimate is an estimate of thefraction of different types of molecules at the site. In otherembodiments, the first estimate is an estimate of the amount of signalcorresponding to different types of molecules at the site. In apreferred embodiment, the estimate of the fraction of different types ofmolecules at the site is a matrix of values, wherein each valuecorresponds to the estimated fraction of each type of molecule predictedor thought to be in the mixture of molecules associated with the site.In another preferred embodiment, the estimate of the amount of signalcorresponding to different types of molecules at the site is a matrix ofvalues, wherein each value corresponds to the estimated amount ofcombined signal generated by each type of molecule predicted or thoughtto be in the mixture of molecules associated with the site.

Once the first estimate is obtained, the variation associated with thatestimate can be determined. Methods of determining the variationassociated with an estimate are further exemplified below. The variationassociated with the first estimate can then be used to calculate anestimate for either the fraction of different types of molecules at thesite or the signal corresponding to different types of molecules at thesite, whichever of these values was not approximated by the firstestimate. In some embodiments, this calculated estimate is referred toas a second estimate. In a preferred embodiment, the second estimate iscalculated using the least squares method.

In a preferred embodiment, the first and second estimates are refined byiteratively updating the estimates. For example, the second estimate isused to refine the first estimate. Subsequently, the refined firstestimate is used to generate a refined second estimate. This refinedsecond estimate can then be used to re-refine the first estimate, whichcan then be used to re-refine the second estimate. This iterativeprocess can continue until the estimates converge on a solution set.

In some embodiments, the initial estimation is performed by guessing.For example, the estimating step can be performed by guessing thefraction of different types of molecules at the site or guessing theamount of signal corresponding to different types of molecules at thesite. For example, a guess can be random or can be based on the expecteddistribution of different types of molecules expected for a particularsample, for example based on statistical methods. In either case, thevariation associated with the guess can then be determined. In someembodiments, the initial estimation is performed through random numbergeneration. For example, the estimating step can be performed by pickinga random number for the fraction of different types of molecules at thesite or picking a random number for the amount of signal correspondingto different types of molecules at the site. In some other embodiments,the initial estimation is performed based on some knowledge about theparticular system being analyzed or similar systems. In someembodiments, mathematical methods known in the art, such as a principlecomponent analysis (PCA), can be used to improve the initial fractionestimate or the initial signal estimate.

Various searching methods known in the art can be used in the estimatingstep and updating step in the methods described herein. Non-limitingexamples of searching methods include searching with iterated maps,tree-based searching, and stochastic searching methods. In someembodiments, searching methods based on iterating a mathematical mappingcan be used in iteratively updating the fraction estimate and signalestimate until the estimates converge. For example, in such embodiments,the updating step can include performing a numerical optimizationalgorithm. In some embodiments, an iterative map search with Fienupupdate rules can be used in the updating step until the fraction andsignal estimates converge.

In some embodiments, the iteration number is from about 5 to about200,000 iterations. In other embodiments, the iteration number is fromabout 100 to about 100,000 iterations. In preferred embodiments, thenumber of iterations to reach convergence ranges from approximately 100to 500 iterations. In some embodiments, more than 500 iterations may berequired; however as demonstrated in Example 1, convergence can bereached in less than 500 iterations. It will be appreciated that theiteration number typically increases with an increase number in types ofmolecules to be detected.

In some embodiments molecules or a mixture of molecules are associatedwith a bead or particle. In some such embodiments, the bead can be amicrobead, nanobead or picobead. In some embodiments, the beads orparticles can have regular or irregular shapes. Furthermore, beadshaving a range of sizes and/or surface textures can be used. In someembodiments, the beads range in size from 5 to 500 microns in diameter.In other embodiments, the beads range from 5 to 5000 nanometers indiameter. In still other embodiments, the beads range from 5 to 5000picometers in diameter. In some embodiments, the beads are solid. Inother embodiments, the beads can be porous or hollow. In suchembodiments, the molecules can be associated with the bead by “trapping”the molecules within the bead. In a preferred embodiment, molecules areassociated with the surface of a solid bead.

In some embodiments, the solid support includes a flow chamber. In otherembodiments, the solid support includes a flow-cell. In a preferredembodiment, molecules are associated with a bead which itself isassociated with a surface of a flow-cell or flow-chamber. In someembodiments, the solid support includes a multiwell plate. In some suchembodiments, the molecules can be provided directly in the wells. Inother such embodiments, the molecules can be associated with beads,which are provided to the wells.

In some preferred embodiments, sequence data can be obtained for two ormore types of molecules. In some embodiments, sequence data can beobtained for two, three, four, five, six, seven, eight, nine, ten,eleven, twelve, thirteen, fourteen, fifteen, sixteen, seventeen,eighteen, nineteen, twenty, or more types of molecules. In some suchembodiments, the nucleotide sequence of a target region of a first typeof molecule is compared to the nucleotide sequence of a variant of thetarget region that is present in a second type of molecule.

In some embodiments, where nucleic acids are used, sequence data isobtained by a nucleic acid sequencing process, such assequencing-by-synthesis. In some such embodiments, the nucleotidesequence of a target region of a first type of molecule is compared tothe nucleotide sequence of a variant of the target region that ispresent in a second type of molecule.

In some embodiments, the nucleic acids comprise a first subpopulation ofnucleic acids and a second subpopulation of nucleic acids, the nucleicacids of the first subpopulation each have an identical target regionand the nucleic acids of the second subpopulation each have an identicalregion that is a variant of the target region. In some embodiments, thenucleotide sequence of the target region of the nucleic acids of thefirst subpopulation has at least 1, at least 2, at least 3, at least 4,at least 5, at least 6, at least 7, at least 8, at least 9, at least 10,at least 15, at least 20, at least 25, at least 50, at least 75 or atleast 100 nucleotide(s) that is/are different as compared to thenucleotide sequence of the variant of the target region. Alternativelyor additionally, the nucleotide sequence of the target region of thenucleic acids of the first subpopulation has at most 1, at most 2, atmost 3, at most 4, at most 5, at most 6, at most 7, at most 8, at most9, at most 10, at most 15, at most 20, at most 25, at most 50, at most75 or at most 100 nucleotide(s) that is/are different as compared to thenucleotide sequence of the variant of the target region.

In some embodiments, a nucleotide sequence difference between the targetregion in the nucleic acids of the first subpopulation and the variantof the target region in the nucleic acids of the second subpopulationcomprises at least one difference selected from the group consisting ofa mutation, a polymorphism, an insertion, a deletion, a substitution, asimple tandem repeat polymorphism, and a single nucleotide polymorphism(SNP).

In some embodiments, the nucleic acids comprise alleles of a geneticlocus from a polyploid organism. In some other embodiments, the nucleicacids comprise alternative splicing forms of a nucleic acid. In yetother embodiments, the nucleic acids comprise alleles of a genetic locusfrom a diploid organism.

Identification of a Tar Et Re Ion of a Nucleic Acid

Also disclosed herein are methods that can be used for identifying atarget region of a nucleic acid. In some embodiments, the methodsinclude the steps of (a) associating a first subpopulation of nucleicacids with a site on a solid support, wherein nucleic acids of the firstsubpopulation comprise an identical target region; (b) associating asecond subpopulation of nucleic acids with the site on the solidsupport, wherein nucleic acids of the second subpopulation comprise anidentical target region that is a variant of the target region of thenucleic acids of the first subpopulation; (c) detecting a signalcorresponding to one or more nucleotides of the target region of firstsubpopulation nucleic acids and one or more nucleotides of the variantof the target region of second subpopulation nucleic acids; (d)estimating the fraction of first subpopulation nucleic acids and secondsubpopulation nucleic acids associated with the site or estimating theamount of signal corresponding to first subpopulation nucleic acids andsecond subpopulation nucleic acids associated with the site; (e)calculating the amount of signal corresponding to first subpopulationnucleic acids and second subpopulation nucleic acids associated with thesite using the fraction estimate, or calculating the fraction of firstsubpopulation nucleic acids and second subpopulation nucleic acidsassociated with the site using the signal estimate; and (f) iterativelyupdating the fraction estimate and signal estimate until the estimatesconverge, thereby identifying a target region of a nucleic acid.

In some embodiments of the above-described methods, at least a first andsecond subpopulation of nucleic acids are associated with a site on asolid support. In a preferred embodiment, the first subpopulationnucleic acids and the second subpopulation nucleic acids are attached tothe solid support at the site.

In some embodiments of the methods of identifying target regions ofnucleic acids set forth herein, the target regions of the nucleic acidscan be of the same or different lengths. In some embodiments, the targetregion can comprise at least 1 nucleotide, at least about 2 nucleotides,at least about 3 nucleotides, at least about 4 nucleotides, at leastabout 5 nucleotides, at least about 10 nucleotides, at least about 15nucleotides, at least about 20 nucleotides, at least about 25nucleotides, at least about 30 nucleotides, at least about 35nucleotides, at least about 40 nucleotides, at least about 45nucleotides, at least about 50 nucleotides, at least about 55nucleotides, at least about 60 nucleotides, at least about 65nucleotides, at least about 70 nucleotides, at least about 75nucleotides, at least about 80 nucleotides, at least about 85nucleotides, at least about 90 nucleotides, at least about 95nucleotides, at least about 100 nucleotides, at least about 150nucleotides, at least about 200 nucleotides, at least about 300nucleotides, at least about 400 nucleotides, at least about 500nucleotides, at least about 600 nucleotides, at least about 700nucleotides, at least about 800 nucleotides, at least about 900nucleotides, or at least about 1000 nucleotides. Alternatively oradditionally, the target region can comprise at most 1 nucleotide, atmost about 2 nucleotides, at most about 3 nucleotides, at most about 4nucleotides, at most about 5 nucleotides, at most about 10 nucleotides,at most about 15 nucleotides, at most about 20 nucleotides, at mostabout 25 nucleotides, at most about 30 nucleotides, at most about 35nucleotides, at most about 40 nucleotides, at most about 45 nucleotides,at most about 50 nucleotides, at most about 55 nucleotides, at mostabout 60 nucleotides, at most about 65 nucleotides, at most about 70nucleotides, at most about 75 nucleotides, at most about 80 nucleotides,at most about 85 nucleotides, at most about 90 nucleotides, at mostabout 95 nucleotides, at most about 100 nucleotides, at most about 150nucleotides, at most about 200 nucleotides, at most about 300nucleotides, at most about 400 nucleotides, at most about 500nucleotides, at most about 600 nucleotides, at most about 700nucleotides, at most about 800 nucleotides, at most about 900nucleotides, or at most about 1000 nucleotides.

In some embodiments of the methods of identifying target regions ofnucleic acids, the estimating, determining, calculating and updatingsteps are performed as described above for molecule detection methods.In a preferred embodiment, estimating the fraction of firstsubpopulation nucleic acids and second subpopulation nucleic acidsassociated with the site includes performing a principal componentanalysis (PCA) to improve the estimation. In some embodiments of theabove-described methods, the initial estimating of the amount of signalcorresponding to first subpopulation nucleic acids and/or secondsubpopulation nucleic acids associated with the site includes performinga principal component analysis (PCA) to improve the estimation.

As discussed above, once a first estimate is obtained, it can be used tocalculate an estimate for either the fraction of first and secondsubpopulation nucleic acids at the site or the signal corresponding tothe first and second subpopulation nucleic acids at the site, whicheverof these values was not approximated by the first estimate. In someembodiments, this calculated estimate is referred to as a secondestimate. In a preferred embodiment, the second estimate is calculatedusing the least squares method.

As also discussed above, the first and second estimates are refined byiteratively updating the estimates. For example, the second estimate isused to refine the first estimate. Subsequently, the refined firstestimate is used to generate a refined second estimate. This refinedsecond estimate can then be used to re-refine the first estimate, whichcan then be used to re-refine the second estimate. This iterativeprocess can continue until the estimates converge.

In some embodiments of the methods of identifying target regions ofnucleic acids, a numerical optimization algorithm is performed toiteratively update the fraction estimate and signal estimate until theestimates converge. In some such embodiments, the numerical optimizationalgorithm is based on iterative map search. In some other embodiments,the numerical optimization algorithm is based on Fienup's iteration map.

It will be appreciated that more than two subpopulations of nucleicacids can be associated with a site. In some embodiments, threesubpopulations of nucleic acids, four subpopulations of nucleic acids,five subpopulations of nucleic acids, six subpopulations of nucleicacids, seven subpopulations of nucleic acids, eight subpopulations ofnucleic acids, nine subpopulations of nucleic acids, ten subpopulationsof nucleic acids, twenty subpopulations of nucleic acids, thirtysubpopulations of nucleic acids, forty subpopulations of nucleic acids,fifty subpopulations of nucleic acids or more than fifty subpopulationsof nucleic acids can be associated with a site on a solid support.

Various numbers of nucleic acids can be associated with the site, forexample, about 1 to about 100,000 nucleic acids can be associated withthe site. In some embodiments, about 1,000 to about 10,000 nucleic acidsare associated with the site. In some other embodiments, about 2,000 toabout 8,000 nucleic acids are associated with the site. In still someother embodiments, about 3,000 to about 6,000 nucleic acids areassociated with the site. In yet some other embodiments, about 4,000 toabout 5,000 nucleic acids are associated with the site.

In a preferred embodiment, the initial number of nucleic acidsassociated with a site ranges from about 10 to about 1000 nucleic acids.In another preferred embodiment, about 10 to about 500 nucleic acids areinitially associated with the site. In yet another preferred embodiment,about 10 to about 100 nucleic acids are initially associated with thesite.

In some embodiments of the methods of identifying a target region of anucleic acid described herein, a site includes about 2 to about 10¹¹nucleic acids, about 2 to about 10¹⁰ nucleic acids, about 2 to about 10⁹nucleic acids, about 2 to about 10⁸ nucleic acids, about 2 to about 10⁷nucleic acids, about 2 to about 10⁶ nucleic acids, about 2 to about 10⁵nucleic acids or about 2 to about 10⁴ nucleic acids. In otherembodiments, a site includes about 10 to about 10¹¹ nucleic acids, about10 to about 10¹⁰ nucleic acids, about 10 to about 10⁹ nucleic acids,about 10 to about 10⁸ nucleic acids, about 10 to about 10⁷ nucleicacids, about 10 to about 10⁶ nucleic acids, about 10 to about 10⁵nucleic acids or about 10 to about 10⁴ nucleic acids. In still otherembodiments, the site includes about 50 to about 10¹¹ nucleic acids,about 50 to about 10¹⁰ nucleic acids, about 50 to about 10⁹ nucleicacids, about 50 to about 10⁸ nucleic acids, about 50 to about 10⁷nucleic acids, about 50 to about 10⁶ nucleic acids, about 50 to about10⁵ nucleic acids or about 50 to about 10⁴ nucleic acids. In yet otherembodiments, a site includes about 100 to about 10¹¹ nucleic acids,about 100 to about 10¹⁰ nucleic acids, about 100 to about 10⁹ nucleicacids, about 100 to about 10⁸ nucleic acids, about 100 to about 10⁷nucleic acids, about 100 to about 10⁶ nucleic acids, about 100 to about10⁵ nucleic acids or about 100 to about 10⁴ nucleic acids. In any of theabove-described embodiments of the methods described herein, the nucleicacids present at a site can be detected in aggregate.

In some embodiments, the nucleotide sequence of the target region offirst subpopulation nucleic acids has at least 1 nucleotide that isdifferent as compared to the nucleotide sequence of the variant of thetarget region of second subpopulation nucleic acids. In some otherembodiments, the nucleotide sequence of the target region of firstsubpopulation nucleic acids has at least 2, at least 3, at least 4, atleast 5, at least 6, at least 7, at least 8, at least 9, at least 10, atleast 11, at least 12, at least 13, at least 14, at least 15, at least16, at least 17, at least 18, at least 19, at least 20, at least 25, atleast 30, at least 35, at least 40, at least 45, at least 50, at least55, at least 60, at least 65, at least 70, at least 75, at least 80, atleast 85, at least 90, at least 95 or at least 100 nucleotides that aredifferent as compared to the nucleotide sequence of the variant of thetarget region of second subpopulation nucleic acids. Alternatively oradditionally, the nucleotide sequence of the target region of firstsubpopulation nucleic acids has at most 1, at most 2, at most 3, at most4, at most 5, at most 6, at most 7, at most 8, at most 9, at most 10, atmost 11, at most 12, at most 13, at most 14, at most 15, at most 16, atmost 17, at most 18, at most 19, at most 20, at most 25, at most 30, atmost 35, at most 40, at most 45, at most 50, at most 55, at most 60, atmost 65, at most 70, at most 75, at most 80, at most 85, at most 90, atmost 95 or at most 100 nucleotides that are different as compared to thenucleotide sequence of the variant of the target region of secondsubpopulation nucleic acids.

In some embodiments of the above-described methods, a nucleotidesequence difference between the target region in first subpopulationnucleic acids of and the variant of the target region in secondsubpopulation nucleic acids comprises at least one difference selectedfrom the group consisting of a mutation, a polymorphism, an insertion, adeletion, a substitution, a simple tandem repeat polymorphism, and asingle nucleotide polymorphism (SNP). In some embodiments of theabove-described methods, a nucleotide sequence difference between thetarget region in first subpopulation nucleic acids of and the variant ofthe target region in second subpopulation nucleic acids comprises atleast two, three, four, five, six, seven, eight, nine, ten, or moredifferences selected from the group consisting of a mutation, apolymorphism, an insertion, a deletion, a substitution, a simple tandemrepeat polymorphism, and a single nucleotide polymorphism (SNP).Alternatively or additionally, a nucleotide sequence difference betweenthe target region in first subpopulation nucleic acids of and thevariant of the target region in second subpopulation nucleic acidscomprises at most one, two, three, four, five, six, seven, eight, nine,ten, or more differences selected from the group consisting of amutation, a polymorphism, an insertion, a deletion, a substitution, asimple tandem repeat polymorphism, and a single nucleotide polymorphism(SNP).

In some embodiments, first subpopulation nucleic acids and secondsubpopulation nucleic acids comprise alleles of a genetic locus from apolyploid organism. In some other embodiments, first subpopulationnucleic acids and second subpopulation nucleic acids comprisealternative splicing forms of a nucleic acid. In still some otherembodiments, first subpopulation nucleic acids and second subpopulationnucleic acids comprise alleles of a genetic locus from a diploidorganism.

In some embodiments, sequence data is obtained for one subpopulation ofnucleic acids or both the first and second subpopulations of nucleicacid. In some such embodiments, sequence data is obtained by asequencing-by-synthesis process, for example a pyrosequencing process.In other embodiments, the sequence data is obtained from a sequencing byligation process. In still other embodiments, the sequence data isobtained from other sequencing processes known in the art. Variousmethods of sequencing nucleic acids are described further below.

Sequencing Methods

Embodiments of the methods and compositions disclosed herein relate tonucleic acid (polynucleotide) sequencing. In some methods andcompositions described herein, the nucleotide sequence of a portion of atarget nucleic acid or fragment thereof can be determined using avariety of methods and devices. Examples of sequencing methods includeelectrophoretic, sequencing by synthesis, sequencing by ligation,sequencing by hybridization, single-molecule sequencing, and real timesequencing methods. In some embodiments, the process to determine thenucleotide sequence of a target nucleic acid or fragment thereof can bean automated process. In some embodiments, capture probes can functionas primers permitting the priming of a nucleotide synthesis reactionusing a polynucleotide from the nucleic acid sample as a template. Inthis way, information regarding the sequence of the polynucleotidessupplied to the array can be obtained. In some embodiments,polynucleotides hybridized to capture probes on the array can serve assequencing templates if primers that hybridize to the polynucleotidesbound to the capture probes and sequencing reagents are further suppliedto the array. Methods of sequencing using arrays have been describedpreviously in the art.

Electrophoretic sequencing methods include Sanger sequencing protocolsand conventional electrophoretic techniques (Sanger, F., Nicklen, S. andCoulson, A. R. (1977) DNA sequencing with chain-terminating inhibitors.Proc. Natl. Acad. Sci. USA. 74(12), 5463-7; Swerdlow, H., Wu, S. L.,Harke, H. & Dovichi, N.J. Capillary gel electrophoresis for DNAsequencing. Laser-induced fluorescence detection with the sheath flowcuvette. J. Chromatogr. 516, 61-67 (1990); Hunkapiller, T., Kaiser, R.J., Koop, B. F. & Hood, L. Large-scale and automated DNA sequencedetermination. Science 254, 59-67 (1991)). In such embodiments,electrophoresis can be carried out on a microfabricated device (Paegel,B. M., Blazej, R. G. & Mathies, R. A. Microfluidic devices for DNAsequencing: sample preparation and electrophoretic analysis. Curr. Opin.Biotechnol. 14, 42-50 (2003); Hong, J. W. & Quake, S. R. Integratednanoliter systems. Nat. Biotechnol. 21, 1179-1183 (2003), thedisclosures of which are incorporated herein by reference in theirentireties).

In some embodiments described herein, nucleic acid sequencing isperformed. Such sequencing can include, but is not limited to,sequencing-by-synthesis (SBS). In SBS, fluorescently labeled modifiednucleotides are used to determine the sequence of nucleotides fornucleic acids present on the surface of a support structure such as aflowcell. Exemplary SBS systems and methods which can be utilized withthe apparatus and methods set forth herein are described in US PatentApplication Publication No. 2007/0166705, US Patent ApplicationPublication No. 2006/0188901, U.S. Pat. No. 7,057,026 US PatentApplication Publication No. 2006/0240439, US Patent ApplicationPublication No. 2006/0281109, PCT Publication No. WO 05/065814, USPatent Application Publication No. 2005/0100900, PCT Publication No. WO06/064199 and PCT Publication No. WO 07/010251, each of which isincorporated herein by reference in its entirety.

With respect to other uses of the methods and compositions describedherein, arrayed nucleic acids are treated by several repeated cycles ofan overall sequencing process. In some embodiments, the attached nucleicacids are prepared such that they include an oligonucleotide primer(capture probe) hybridized to an unknown target sequence or hybridizedto another template nucleic acid or polynucleotide whether the sequenceidentity is known or unknown. To initiate the first SBS sequencingcycle, one or more differently labeled nucleotides and a DNA polymerasecan be introduced to the array. Either a single nucleotide can be addedat a time, or the nucleotides used in the sequencing procedure can bespecially designed to possess a reversible termination property, thusallowing each cycle of the sequencing reaction to occur simultaneouslyin the presence of all four labeled nucleotides (A, C, T, G). Followingnucleotide addition, signals produced at the features on the surface canbe detected to determine the identity of the incorporated nucleotide(based on the labels on the nucleotides). Reagents can then be added toremove the blocked 3′ terminus (if appropriate) and to remove labelsfrom each incorporated base. Reagents, enzymes and other substances canbe removed between steps by washing. Such cycles are then repeated andthe sequence of each cluster is read over the multiple chemistry cycles.

Preferred embodiments include sequencing by synthesis (SBS) techniques.SBS techniques generally involve the enzymatic extension of a nascentnucleic acid strand through the iterative addition of nucleotidesagainst a template strand. Each nucleotide addition queries one or a fewbases of the template strand. In one exemplary type of SBS, cyclesequencing is accomplished by stepwise addition of reversible terminatornucleotides containing, for example, a cleavable or photobleachable dyelabel. This approach is being commercialized by Solexa (now Illumina),and is also described in WO 91/06678, which is incorporated herein byreference in its entirety. The availability of fluorescently-labeledterminators in which both the termination can be reversed and thefluorescent label cleaved is important to facilitating efficient cyclicreversible termination (CRT) sequencing. Polymerases can also beco-engineered to efficiently incorporate and extend from these modifiednucleotides. In particular embodiments, reversible terminators/cleavablefluors can include fluor linked to the ribose moiety via a 3′ esterlinkage (Metzker, Genome Res. 15:1767-1776 (2005), which is incorporatedherein by reference). Other approaches have separated the terminatorchemistry from the cleavage of the fluorescence label (Ruparel et al.,Proc Natl Acad Sci USA 102: 5932-7 (2005), which is incorporated hereinby reference in its entirety). Ruparel et al described the developmentof reversible terminators that used a small 3′ allyl group to blockextension, but could easily be deblocked by a short treatment with apalladium catalyst. The fluorophore was attached to the base via aphotocleavable linker that could easily be cleaved by a 30 secondexposure to long wavelength UV light. Thus, both disulfide reduction orphotocleavage can be used as a cleavable linker. Another approach toreversible termination is the use of natural termination that ensuesafter placement of a bulky dye on a dNTP. The presence of a chargedbulky dye on the dNTP can act as an effective terminator through stericand/or electrostatic hindrance. The presence of one incorporation eventprevents further incorporations unless the dye is removed. Cleavage ofthe dye removes the fluor and effectively reverses the termination.Examples of modified nucleotides are also described in U.S. Pat. No.7,427,673, and U.S. Pat. No. 7,057,026, the disclosures of which areincorporated herein by reference in their entireties.

In certain preferred embodiments, sequencing is performed by employingone or more versions of sequencing-by-synthesis (SBS). SBS is a processin which one or more nucleotides or oligonucleotides are sequentiallyadded to a extending polynucleotide chain in the 5′ to 3′ direction toform an extended polynucleotide complementary to the template nucleicacid to be sequenced. The identity of the base present in one or more ofthe added nucleotide(s) can be determined in a detection or imagingstep, preferably after each nucleotide incorporation. In variousembodiments involving SBS, fluorescently labeled nucleotides are used inthe sequencing reaction. The four different bases are each labeled witha unique fluorescent label to permit identification of the incorporatednucleotide as successive nucleotides are added. The labeled nucleosidetriphosphates also can have a removable 3′ blocking group to preventfurther incorporation. The label of the incorporated base can bedetermined and the blocking group removed to permit further extension.

The labels may be the same for each type of nucleotide, or eachnucleotide type may carry a different label. This facilitates theidentification of incorporation of a particular nucleotide. Thus, forexample modified adenine, guanine, cytosine and thymine would all haveattached a different fluorophore to allow them to be discriminated fromone another readily. When sequencing on arrays, a mixture of labeled andunlabelled nucleotides may be used. Detectable labels such asfluorophores can be linked to nucleotides via the base using a suitablelinker. The linker may be acid labile, photolabile or contain adisulfide linkage. Preferred labels and linkages include those disclosedin U.S. Pat. No. 7,057,026. Other linkages, in particularphosphine-cleavable azide-containing linkers, may be employed in theinvention as described in greater detail in US 2006/0160081. Thecontents of U.S. Pat. No. 7,057,026 and US 2006/0160081 are incorporatedherein by reference.

Methods for detecting fluorescently labeled nucleotides generally useincident light (e.g. laser light) of a wavelength specific for thefluorescent label, or the use of other suitable sources of illumination,to excite the fluorophore. Fluorescent light emitted from thefluorophore may then be detected at the appropriate wavelength using asuitable detection system such as for example a Charge-Coupled-Device(CCD) camera, which can optionally be coupled to a magnifying device, afluorescent imager or a con focal microscope. In embodiments involvingsequencing carried out on an array, detection of an incorporated basemay be performed by using a scanning microscope to scan the surface ofthe array with a laser and image fluorescent labels attached to theincorporated nucleotide(s). A sensitive 2-D detector, such as acharge-coupled detector (CCD), can be used to visualize the signalsgenerated.

Other sequencing methods that use cyclic reactions can be used, such asthose wherein each cycle can include steps of delivering one or morereagents to nucleic acids, for example, sequencing-by-synthesis andsequencing-by-ligation. Useful pyrosequencing reactions are described,for example, in US Patent Application Publication No. 2005/0191698 andU.S. Pat. No. 7,244,559, each of which is incorporated herein byreference. Sequencing-by-ligation reactions are described, for example,in Shendure et al. Science 309:1728-1732 (2005); U.S. Pat. No.5,599,675; and U.S. Pat. No. 5,750,341, each of which is incorporatedherein by reference in its entirety.

Several embodiments include pyrosequencing techniques. Pyrosequencingdetects the release of inorganic pyrophosphate (PPi) as particularnucleotides are incorporated into the nascent strand (Ronaghi, M.,Karamohamed, S., Pettersson, B., Uhlen, M. and Nyren, P. (1996)“Real-time DNA sequencing using detection of pyrophosphate release.”Analytical Biochemistry 242(1), 84-9; Ronaghi, M. (2001) “Pyrosequencingsheds light on DNA sequencing.” Genome Res. 11(1), 3-11; Ronaghi, M.,Uhlen, M. and Nyren, P. (1998) “A sequencing method based on real-timepyrophosphate.” Science 281(5375), 363; U.S. Pat. No. 6,210,891; U.S.Pat. No. 6,258,568 and U.S. Pat. No. 6,274,320, the disclosures of whichare incorporated herein by reference in their entireties). Inpyrosequencing, released PPi can be detected by being immediatelyconverted to adenosine triphosphate (ATP) by ATP sulfurylase, and thelevel of ATP generated is detected via luciferase-produced photons.

Some embodiments include methods utilizing sequencing by hybridizationtechniques. In such embodiments, differential hybridization ofoligonucleotide probes can be used to decode a target DNA sequence(Bains, W. and Smith, G. C. A novel method for nucleic acid sequencedetermination. Journal of Theoretical Biology 135(3), 303-7 (1988);Drmanac, S. et al., Accurate sequencing by hybridization for DNAdiagnostics and individual genomics. Nature Biotechnology 16, 54-58(1998); Fodor, S. P. A., Read, J. L., Pirrung, M. C., Stryer, L., Lu, A.T. and Solas, D. Light-directed, spatially addressable parallel chemicalsynthesis. Science 251(4995), 767-773 (1995); Southern, E. M. (1989)Analyzing polynucleotide sequences. WO 1989/10977), the disclosures ofwhich are incorporated herein by reference in their entireties). Thetarget DNA can be immobilized on a solid support and serialhybridizations can be performed with short probe oligonucleotides, forexample, oligonucleotides 5 to 8 nucleotides in length. The extent towhich specific probes bind to the target DNA can be used to infer theunknown sequence. Target DNA can also be hybridized to high densityoligonucleotide arrays (Lipshutz, R. J. et al., (1995) Usingoligonucleotide probe arrays to access genetic diversity. Biotechniques19, 442-447, the disclosure of which is incorporated herein by referencein its entirety).

Some embodiments can utilize nanopore sequencing (Deamer and Akeson,2000; Deamer and Branton, 2002; Li et al., 2003, the disclosure of whichis incorporated herein by reference in its entirety). In suchembodiments, the target nucleic acid or nucleotides exonucleolyticallyremoved from the target nucleic acid pass through a nanopore. Thenanopore can be a synthetic pore or biological membrane protein, such asα-hemolysin (Deamer, D. W. & Akeson, M. Nanopores and nucleic acids:prospects for ultrarapid sequencing. Trends Biotechnol. 18, 147-151(2000), the disclosure of which is incorporated herein by reference inits entirety). As the target nucleic acid or nucleotides derivedtherefrom pass through the nanopore, each type of base can be identifiedby measuring fluctuations in the electrical conductance of the pore.(U.S. Pat. No. 7,001,792; Soni, G. V. & Meller, A. Progress towardultrafast DNA sequencing using solid-state nanopores. Clin. Chem. 53,1996-2001 (2007); Healy, K. Nanopore-based single-molecule DNA analysis.Nanomed. 2, 459-481 (2007); Cockroft, S. L., Chu, J., Amorin, M. &Ghadiri, M. R. A single-molecule nanopore device detects DNA polymeraseactivity with single-nucleotide resolution. J. Am. Chem. Soc. 130,818-820 (2008); Levene, M. J. et al. Zero-mode waveguides forsingle-molecule analysis at high concentrations. Science 299, 682-686(2003), the disclosures of which are incorporated herein by reference intheir entireties).

Some embodiments can utilize methods involving the real-time monitoringof DNA polymerase activity. Nucleotide incorporations can be detectedthrough fluorescence resonance energy transfer (FRET) interactionsbetween a fluorophore-bearing polymerase and γ-phosphate-labelednucleotides, or with zeromode waveguides. The illumination can berestricted to a zeptoliter-scale volume around a surface-tetheredpolymerase such that incorporation of fluorescently labeled nucleotidescan be observed with low background (Levene, M. J. et al. Zero-modewaveguides for single-molecule analysis at high concentrations. Science299, 682-686 (2003); Lundquist, P. M. et al. Parallel confocal detectionof single molecules in real time. Opt. Lett. 33, 1026-1028 (2008);Korlach, J. et al. Selective aluminum passivation for targetedimmobilization of single DNA polymerase molecules in zero-mode waveguidenanostructures. Proc. Natl. Acad. Sci. USA 105, 1176-1181 (2008), thedisclosures of which are incorporated herein by reference in theirentireties).

The above SBS methods can be advantageously carried out in multiplexformats such that multiple different target nucleic acids aremanipulated simultaneously. In particular embodiments, different targetnucleic acids can be treated in a common reaction vessel or on a surfaceof a particular substrate. This allows convenient delivery of sequencingreagents, removal of unreacted reagents and detection of incorporationevents in a multiplex manner. In embodiments using surface-bound targetnucleic acids, the target nucleic acids can be in an array format. In anarray format, the target nucleic acids can be typically bound to asurface in a spatially distinguishable manner. The target nucleic acidscan be bound by direct covalent attachment, attachment to a bead orother particle or binding to a polymerase or other molecule that isattached to the surface. The array can include a single copy of a targetnucleic acid at each site (also referred to as a feature) or multiplecopies having the same sequence can be present at each site or feature.Multiple copies can be produced by amplification methods such as, bridgeamplification or emulsion PCR as described in further detail below.

In embodiments involving sequencing on a substrate such as an array,paired end reads may be obtained on nucleic acid clusters. Methods forobtaining paired end reads are described in WO/07010252 and WO/07091077,each of which is incorporated herein by reference. Paired end sequencingfacilitates reading both the forward and reverse template strands ofeach cluster during one paired-end read. Generally, template clustersare amplified on the surface of a substrate (e.g. a flow-cell) by bridgeamplification and sequenced by paired primers sequentially. Uponamplification of the template strands, a bridged double strandedstructure is produced. This can be treated to release a portion of oneof the strands of each duplex from the surface. The single strandednucleic acid is available for sequencing, primer hybridization andcycles of primer extension. After the first sequencing run, the ends ofthe first single stranded template can be hybridized to the immobilizedprimers remaining from the initial cluster amplification procedure. Theimmobilized primers can be extended using the hybridized first singlestrand as a template to resynthesize the original double strandedstructure. The double stranded structure can be treated to remove atleast a portion of the first template strand to leave the resynthesizedstrand immobilized in single stranded form. The resynthesized strand canbe sequenced to determine a second read, whose location originates fromthe opposite end of the original template fragment obtained from thefragmentation process.

It will be appreciated that any of the above-described sequencingprocesses can be incorporated into the methods and/or compositionsdescribed herein. Furthermore, it will be appreciated that other knownsequencing processes can be easily by implemented for use with themethods and/or compositions described herein.

Molecule Detection Systems

Also disclosed herein are molecule detection systems. The systems caninclude a solid support comprising molecules associated with a site onthe solid support such that the molecules are detected in aggregate,wherein the molecules comprise at least two different types ofmolecules, and a detector configured to detect the molecules associatedwith the site.

As will be appreciated by those in the art, the number of possible solidsupports is very large. Possible solid supports include, but are notlimited to, glass and modified or functionalized glass, plastics(including acrylics, polystyrene and copolymers of styrene and othermaterials, polypropylene, polyethylene, polybutylene, polyurethanes,TeflonJ, etc.), polysaccharides, nylon or nitrocellulose, resins, silicaor silica-based materials including silicon and modified silicon,carbon, metals, inorganic glasses, plastics, optical fiber bundles, anda variety of other polymers. In some embodiments, the solid supportsallow optical detection and do not themselves appreciably fluorescese.

In some embodiments, the configuration of the solid support is flat(planar), although as will be appreciated by those in the art, otherconfigurations of solid supports may be used as well. For example, threedimensional configurations can be used. In some embodiments, the solidsupport can be hollow or porous. For example, beads having a mixture ofmolecules attached thereto can be embedded in a porous block of plasticthat allows sample access to the beads and using a confocal microscopefor detection. Similarly, beads may be placed on the inside surface of atube, on the inner surface of a flow cell or on one of the lanes of amulti-lane flow chamber. In some such embodiments, the aboveconfigurations permit flow-through sample analysis and/or reduce thenecessary sample volume. In other embodiments, the solid supportcomprises wells, such as microwells.

In some embodiments, fiber optic bundles can be used as substrates. Insuch embodiments, a mixture of molecules is associated with one end ofthe bundle. In particular, the a mixture of molecules can be associatedwith a fiber end, such as by direct chemical attachment, indirectattachment or other attachments or retention mechanisms. In someembodiments, one or more of the fibers have a well etched into an end,for example, as described in U.S. Pat. No. 7,622,294, the disclosure ofwhich is incorporated herein by reference in its entirety. In suchembodiments, a mixture of molecules can be attached at or within thewell. In other embodiments, a mixture of molecules can be attached to abead or other microparticle, which is then provided to the well. In asome embodiments, the bead is sized to be slightly smaller than thewell. In other embodiments, multiple beads are included in the well. Thebeads may be of the same or different in size, shape, texture and/or theability to generate a particular signal.

In some embodiments, silicon wafer solid supports are used. In someembodiments, the silicon may be doped as known in the art. In someembodiments, the substrate is in the shape of or is a microscope slide.

It will be appreciated that when beads or other particles are used as asolid support, such substrates can be used alone, used in groups ofsimilar or different beads or particles, or used in connection with asecond solid support. Further particular embodiments of bead substratesare described below.

In a preferred embodiment, the methods and compositions described hereinutilize a robotic system. Many systems are generally directed to the useof 96 (or more) well microtiter plates, but as will be appreciated bythose in the art, any number of different plates or configurations ofsolid support may be used. In addition, any or all of the steps outlinedherein may be automated. Thus, for example, the systems may becompletely or partially automated.

As will be appreciated by those in the art, there are a wide variety ofcomponents, one or more of which can be used, including but not limitedto, one or more robotic arms; plate handlers for the positioning ofmicrotiter plates or other solid supports; automated lid handlers toremove and replace lids to cover wells on non-cross contaminationplates; a fluid handling device, such as one that includes tipassemblies for sample distribution; well loading blocks; cooled reagentracks; microtiter plate pipette positions (optionally cooled); stackingtowers for plates and tips; one or more detectors for detecting signals;and computer systems.

Robotic systems can include automated fluid handling and/orparticle-handing devices, including high throughput pipetting to performsteps involved in fluid dispensation, dispersion and/or removal. Thisincludes liquid, and particle manipulations such as aspiration,dispensing, mixing, diluting, washing, accurate volumetric transfers;retrieving, and discarding of pipet tips; and repetitive pipetting ofidentical volumes for multiple deliveries from a single sampleaspiration. In some embodiments, use of such systems can result incross-contamination-free liquid and particle transfers.

In a preferred embodiment, platforms for multi-well plates, multi-tubes,minitubes, deep-well plates, microfuge tubes, cryovials, square wellplates, filters, chips, optic fibers, beads, and other solid-phasematrices or platform with various volumes are accommodated on anupgradable modular platform for additional capacity. This modularplatform can include a variable speed orbital shaker, and multi-positionwork decks for source samples, sample and reagent dilution, assayplates, sample and reagent reservoirs, pipette tips, and an active washstation.

In a preferred embodiment, interchangeable pipet heads (single ormulti-channel) with single or multiple magnetic probes, affinity probes,or pipetters robotically manipulate the liquid and particles. Multi-wellor multi-tube magnetic separators or platforms manipulate liquid andparticles in single or multiple sample formats.

In some preferred embodiments, the instrumentation will include adetection system or detector. In some embodiments, the detection systemprovides a light source, or other energy source, configured to provideexcitation energy to one or more sites on the solid support. In someembodiments, a CCD camera, CMOS or other signal detecting device is usedto capture and detect, or otherwise record, signal data. In someembodiments, such data can be transformed into images or into otherquantifiable formats using a computer. In some embodiments, one or morecomputers are involved in further displaying and/or analyzing the data.As will be discussed more fully below, in some embodiments, a computeror system of computers is used to analyze aggregate signal data inaccordance with the methods for signal deconvolution described herein.

The flexible hardware and software configurations allow instrumentadaptability for multiple applications. For example, in someembodiments, data processing modules or program modules allow creation,modification, and running of methods. In some embodiments, the systemincludes diagnostic modules that allow instrument alignment, correctconnections, and refined motor operations. In addition, configurabletools, labware, and liquid and particle transfer patterns allowdifferent applications to be performed. In some embodiments, robotic andcomputer interfaces allow communication between instruments.

In further embodiments, the molecule detection systems described hereininclude one or more databases that allow method and parameter storage.In some embodiments, signal data may be stored and retrieved fromdatabases. In other embodiments, reference biosignatures can be storedin a database.

In some embodiments, the robotic workstation includes one or moreheating or cooling components. Depending on the reactions and reagents,either cooling or heating may be required, which can be done using anynumber of known heating and cooling systems, including Peltier systems.

In a preferred embodiment, thermocycler and thermoregulating systems areused for stabilizing the temperature of the heat exchangers such ascontrolled blocks or platforms to provide accurate temperature controlof incubating samples from 4° C. to 100° C.

In a preferred embodiment, the robotic apparatus includes a centralprocessing unit (CPU) which communicates with a memory and a set ofinput/output devices (e.g., keyboard, mouse, monitor, printer, etc.)through a bus. The general interaction between a central processingunit, a memory, input/output devices, and a bus is known in the art.Thus, a variety of different procedures, depending on the experiments tobe run, are stored in memory.

It will be appreciated that the molecule detection systems describedherein need not be fully robotic. Rather, in some embodiments, onlyportions or parts of a robotic or automated system can be employed inthe molecule detection systems described herein.

In some embodiments, the methods for aggregate signal deconvolution,such as those described in connection with the methods detection and/oridentification methods set forth herein, can be implemented using one ormore data processing modules or program modules. For example, a firstdata processing module can be configured to estimate the fraction ofdifferent types of molecules at the site or the amount of signalcorresponding to different types of molecules at the site. In someembodiments, the same or a different module can include instructions fordetermining the variation associated one or both of the estimates.

In some embodiments, the systems can further include a second, oranother, data processing module, for example a second data processingmodule, configured to utilize the variation associated with the estimateof one of the above-mentioned parameters to obtain an estimate for theother parameter. For example, the initial estimate for the fraction ofdifferent types of molecules at the site and the variation of thatestimate can be used to calculate an estimate for the amount of signalcorresponding to different types of molecules at the site. In someembodiments, this second data processor module can be combined with thefirst data processing or program module.

In other embodiments of the molecule detection systems described herein,the systems can further include a third, or another, data processingmodule configured to iteratively update the fraction estimate and signalestimate. In a preferred embodiment, the estimates are updated untilthey converge at or near a solution set.

In some embodiments, the systems are configured to detect and/oridentify entire molecules. In some embodiments, the systems areconfigured to detect and/or identify portions of molecules. In apreferred embodiment, the systems are configured to detect and/oridentify the complete nucleotide sequence of a nucleic acid or a portionof the complete nucleotide sequence of a nucleic acid, such as thenucleotide sequence of a target region of a nucleic acid.

In preferred embodiments, solid supports comprise sites at whichmolecules are associated and detected in aggregate. In some embodiments,when the solid support comprises one or more wells, the sites wheremolecules are associated can include wells. In some embodiments, thesolid support is a bead and the site is the entire bead. In otherembodiments, the solid support is a bead and the site is a portion orpart of a bead. In some embodiments, the one or more beads areassociated with a second solid support, such as wells of a microtiter orother multiwell plate, a lane or channel in a flow cell or a lane orchannel in a multi-lane flow chamber.

In some embodiments of the systems described herein, the solid supportis a plate, such as a picotiter plate, and the wells can further includeone or more beads selected from the group consisting of beads having anucleic acid associated therewith or attached thereto, beads having oneor more enzymes associated therewith or attached thereto and blank beadsor other beads having no nucleic acid or enzyme associated therewith orattached thereto. In preferred embodiments where beads having associatedor attached enzymes are contemplated, the enzymes can comprise enzymesor enzymes systems useful for generating a detectable signal. Forexample, the signal can be colorimetric as is typical of the signalgenerating systems utilized with horseradish peroxidase. Other examplesof enzymes or enzymes systems useful for generating signals include, butare not limited to, sulfurylase and luciferase. In some suchembodiments, the enzyme system includes a sulfurylase enzyme and aluciferase enzyme associated with the same bead. In other suchembodiments, the enzyme system includes a sulfuylase enzyme associatedwith a bead that is separate from the luciferase enzyme. In someembodiments, the well further includes beads having neither a nucleicacid nor an enzyme attached thereto.

Various numbers of molecules can be associated with the site. In someembodiments, about 1,000 to about 10,000 molecules are associated withthe site. In some other embodiments, about 2,000 to about 8,000molecules are attached to the site. In still some other embodiments,about 3,000 to about 6,000 molecules are attached to the site. In yetsome other embodiments, about 4,000 to about 5,000 molecules areattached to the site.

In a preferred embodiment, the initial number of molecules associatedwith a site ranges from about 10 to about 1000 molecules. In anotherpreferred embodiment, about 10 to about 500 molecules are initiallyassociated with the site. In yet another preferred embodiment, about 10to about 100 molecules are initially associated with the site.

In some embodiments of the molecule detection systems described herein,a site includes about 2 to about 10¹¹ molecules, about 2 to about 10¹⁰molecules, about 2 to about 10⁹ molecules, about 2 to about 10⁸molecules, about 2 to about 10⁷ molecules, about 2 to about 10⁶molecules, about 2 to about 10⁵ molecules, about 2 to about 10⁴molecules. In other embodiments, a site includes about 10 to about 10¹¹molecules, about 10 to about 10¹⁰ molecules, about 10 to about 10⁹molecules, about 10 to about 10⁸ molecules, about 10 to about 10⁷molecules, about 10 to about 10⁶ molecules, about 10 to about 10⁵molecules, about 10 to about 10⁴ molecules. In still other embodiments,the site includes about 50 to about 10¹¹ molecules, about 50 to about10¹⁰ molecules, about 50 to about 10⁹ molecules, about 50 to about 10⁸molecules, about 50 to about 10⁷ molecules, about 50 to about 10⁶molecules, about 50 to about 10⁵ molecules, about 50 to about 10⁴molecules. In yet other embodiments, a site includes about 100 to about10¹¹ molecules, about 100 to about 10¹⁰ molecules, about 100 to about10⁹ molecules, about 100 to about 10⁸ molecules, about 100 to about 10⁷molecules, about 100 to about 10⁶ molecules, about 100 to about 10⁵molecules, about 100 to about 10⁴ molecules. In any of theabove-described embodiments of the molecule detection systems describedherein, the molecules present at a site can be detected in aggregate. Insome embodiments, the molecules are associated with the site. In otherembodiments, the molecules are attached at the site. In certainembodiments, the molecules comprise nucleic acids.

Computer Implemented Embodiments

In some embodiments, one or more steps, such as those set forth herein,are carried out by a computer. In some embodiments, a computer is usedto estimate the fraction of different types of molecules associated witha site on a solid support. In some embodiments, a computer is used toestimate the amount of signal corresponding to different types ofmolecules associated with the site. In some embodiments, a computer isused to calculate, or otherwise, obtain a signal estimate using thefraction estimate. In some embodiments, a computer is used to calculate,or otherwise obtain, a fraction estimate using the signal estimate. Insome embodiments, a computer is used to iteratively update the fractionestimate and signal estimate until the estimates converge, therebydetecting molecules associated with the site on the solid support. Aswill be appreciated, the fraction estimate and/or signal estimate can beobtained by a variety of mathematical methods known in the artincluding, but not limited to, a numerical optimization algorithm. Inpreferred embodiments the numerical optimization algorithm is based onan iterative map search or Fienup's iteration map.

Exemplary computer systems, which are useful in implementing the methodsand/or compositions described herein, include, but are not limited to,personal computer systems, such as those based on Intel®, IBM®, orMotorola® microprocessors; or work stations such as a SPARC workstationor UNIX workstation. Useful systems include those using the MicrosoftWindows, UNIX or LINUX operating system. The systems and methodsdescribed herein can also be implemented to run on client-server systemsor wide-area networks such as the internet.

A computer system useful in implementing the methods and/or compositionsdescribed herein can be configured to operate as either a client orserver and can include one or more processors which are coupled to arandom access memory (RAM). Implementation of embodiments of the methodsand/or compositions described herein is not limited to any particularenvironment or device configuration. The embodiments of the presentinvention may be implemented in any type of computer system orprocessing environment capable of supporting the methodologies that areset forth herein. In particular embodiments, algorithms can be writtenin MATLAB, C or C++, or other computer languages known in the art.

In some embodiments described herein, a computer can be used in theacquisition and storage of data relating to the compositions and methodsdescribed herein. For example, in some embodiments, the computer can beprogrammed, or otherwise instructed, to sequence data and/or otherrelevant information to a user, another computer, a database or anetwork. In additional embodiments, the computer can also be programmed,or otherwise instructed, to receive relevant information from a user,another computer, a database or a network. Such information can includedata, such as signals or images, obtained from a sequencing method, oneor more reference sequences, characteristics of an organism of interestor the like.

Identification of Biosignatures and Other Applications

Methods, systems and compositions described herein are useful tools inobtaining the biosignature for a population of molecules, such asnucleic acids, associated with a site on a solid support. In someembodiments, molecules associated at a site on a solid support aredetected in aggregate by detecting a signal corresponding to theaggregate of molecules. The aggregate signal can then be deconvoluted inaccordance with one or more of the methods described herein in order toobtain a biosignature for a population or a subpopulation of moleculespresent at the site.

In some embodiments, a solid support having molecules associatedtherewith is provided. In such embodiments, a user then determines abiosignature for one or more populations or subpopulations of moleculesassociated with one or more sites on the solid support. Alternatively,in some embodiments, the user of the solid support, associates moleculesfrom one or more samples with one or more sites on the solid support. Ina preferred embodiment, a population of molecules, which comprises atleast two different types of molecules, and which was obtained from asample, is associated with a site on a solid support. In someembodiments, the molecules of the sample are tagged prior to orsubsequent to associating them with the solid support. In some suchembodiments, the tags are utilized to identify the subject or sourcefrom which the sample was obtained. In some embodiments, the samples areobtained from a plurality of subjects or sources.

The sequencing and/or biosignature information that can be obtainedusing the methods described herein can be used in a variety ofapplications involved in, but not limited to, genotyping, expressionprofiling, identifying alternative splicing, genome mapping, ampliconsequencing, methylation detection, metagenomics, SNP detection, pathogeninfection detection, treatment outcome prediction, pollution detection,determining disease progression state, and environmental monitoring.

In one example, the methods described herein can deconvolute theaggregate signal generated by multiple types of molecules associatedwith a site of a solid support, thereby permitting identification of thedifferent types of molecules present at the site. In some embodiments,the different types of molecules include alleles of a genetic locus froma polyploid organism, including, but not limited to, a diploid organism,such as a mammal (for example, rats, mice and humans); a triploidorganism, such as seedless watermelons; a tetraploid organism, such asSalmonidae fish; a pentaploid organism, such as Kenai Birch, and ahexaploid organism, such as wheat and kiwifruit. In some embodiments,the different types of molecule components include alternative splicingforms of a nucleic acid. In some embodiments, the different types ofmolecules include nucleic acids or proteins from variants of aparticular organism, including, but not limited to, a virus (forexample, HIV, HCV and HBV), a bacterium (for example, a pathogenicbacterium such as E. coli O:157), and/or a eukaryotic cell. In someembodiments, the variation(s) between the different types of molecularcomponents include a mutation, a polymorphism, an insertion, a deletion,a substitution, a simple tandem repeat polymorphism, or a singlenucleotide polymorphism (SNP).

In one example, the methods described herein can provide a biosignaturefor populations of molecules or subpopulations of molecules in a sample.Accordingly, in some embodiments, the complete sequence of each type ofmolecule of the sample need not be determined and, instead, the sequenceof portions of all or some of the different types of molecules of thepopulation or subpopulation can be determined and used.

In some embodiments, the sample is an environmental sample from anyenvironmental source. In some embodiments, the environmental sample isobtained from naturally occurring or artificial sources. In someembodiments, the sample is obtained from the atmosphere, water systems,soil or any other source of interest. In other more particularembodiments, the environmental samples can be obtained from, forexample, atmospheric pathogen collection systems, sub-surface sediments,groundwater, ancient water deep within the ground, plant root-soilinterface of grassland, coastal water and sewage treatment plants.

In some embodiments, the sample can be any kind of investigational,clinical or medical sample. In some examples, samples can be obtainedfrom a subject, such as from the blood, the lungs or the gut of mammals.In some embodiments, the sample is a biological sample obtained from asubject suspected or is suffering from a certain disease. In someembodiments, the molecules comprise a marker from a pathogen.Non-limiting examples of pathogen include a virus, a bacterium, and aeukaryotic cell, such as a tumorogenic or cancer cell.

In some embodiments, the biosignature obtained for all of the moleculesin a sample or a subpopulation of molecules in the sample can becompared to a reference biosignature. In some embodiments, thesimilarities and/or differences between the biosignature obtained forthe sample and the reference biosignature can be used to identify anorganism or pathogen, predict a treatment outcome for a patient,estimate disease progression state, estimate extent of pollution incertain environment.

Bead Mixtures

Some embodiments of the compositions described herein relate to mixturesof beads. In some embodiments, the mixtures of beads comprise aplurality of beads, wherein each bead of the plurality of beadscomprises a first subpopulation and a second subpopulation of nucleicacids. In certain embodiments, the first subpopulation and the secondsubpopulation of nucleic acids are associated with the bead such thatthey are detected in aggregate. In preferred embodiments, the nucleicacids of the first subpopulation each comprise an identical targetregion and the nucleic acids of the second subpopulation each comprisean identical region that is a variant of the target region.

In some embodiments of the above-described mixture of beads, the nucleicacids are attached to each bead of the plurality of beads.

In some embodiments, the mixture of beads is distributed on thesubstrate. In other embodiments, a plurality of beads having beadscomprising both the first subpopulation and a second subpopulation ofnucleic acids is distributed on a substrate. In some embodiments, thedistribution of beads on the substrate is a random distribution. Inother embodiments, the substrate includes wells and the beads aredistributed in the wells. In yet other embodiments, wells of thesubstrate further include beads having an enzyme attached thereto. Inpreferred embodiments, the enzyme can include sulfurylase, luciferase ora combination of sulfurylase and luciferase. In some embodiments, wellsof the substrate further include beads having neither a nucleic acid norenzyme attached thereto.

Solid Supports Having a Reduced Number of Molecules Associated Therewith

Some embodiments described herein relate to beads and/or other solidsupports that are useful for attaching mixtures of molecules at areduced number. In certain preferred embodiments of the methods andsystems described herein, solid supports having a reduced number ofmolecules associated therewith are desirable for certain statisticalanalyses. The embodiments below describe compositions having moleculesassociated therewith at reduced numbers. Additionally, various methodsfor associating molecules with a site on a solid supports at reducednumbers are described.

In preferred embodiments, of the above-described solid support theinitial number of molecules associated with a site ranges from about 10to about 1000 molecules. In another preferred embodiment, about 10 toabout 500 molecules are initially associated with the site. In yetanother preferred embodiment, about 10 to about 100 molecules areinitially associated with the site.

In some embodiments where the molecules are nucleic acids, the solidsupport further comprises one or more capture probes and/or one or moreprimers complementary to a portion of one or more of the moleculesassociated with the site on the solid support.

In some embodiments, the number of capture probes associated with thesite on the solid support can be reduced by reducing the concentrationof capture probes used in the association process. Although this methodproduces solid supports with sites that have a low or considerablyreduced number of capture probes associated therewith, the number ofsites having no capture probes is also increased.

In other embodiments, the number of capture probes associated with thesite on the solid support can be reduced by introducing competitormolecules, such as a complementary oligonucleotides. The complementaryoligonucleotides are hybridized with capture probes at the site suchthat at least about 10%, at least about 20%, at least about 30%, atleast about 40%, at least about 50%, at least about 60%, at least about70%, at least about 80%, at least about 90%, at least about 95%, atleast about 99%, at least about 99.9% of the capture probes arehybridized with competitor oligonucleotides. Subsequently, nucleic acidsof interest can be hybridized with the reduced number of capture probesso as to produce a solid support having a reduced number of nucleicacids of interest associated therewith.

In some embodiments, the above methods are applied to beads. In somesuch embodiments, the resulting bead composition comprises a firstsubpopulation of capture nucleic acids having a competitor moleculehybridized thereto and a second subpopulation of capture nucleic acidscomprising a region that permits hybridization of a complementarymolecule.

In still other embodiments, the number of capture probes associated withthe site on the solid support can be reduced by first amplifying nucleicacids of interest to produce a population of amplified nucleic acidsthat include a region intended to be complementary to a capture probe (atag region), wherein the tag region is degenerate among nucleic acidswithin the population. As such, only a portion of the amplifiedpopulation of nucleic acids will hybridize with capture probes at thesite on the solid support, thereby producing sites having a reducednumber of nucleic acids associated therewith. The number of nucleicacids that hybridize with capture probes can be adjusted up or down byadjusting the amount of degeneracy introduced into the tag region and/oradjusting the stringency of hybridization.

In some embodiments the above methods are applied to beads. In some suchembodiments, the resulting bead composition comprises capture nucleicacids hybridized with an amplified nucleic acid comprising a degeneratetag. In some such embodiments, the degenerate tag is hybridized to acapture nucleic acid.

In some embodiments of the above-described bead compositions, beads aredistributed, or otherwise present, in a channel of a substrate. In otherembodiments, the bead is present in a well of a multiwell substrate. Ina preferred embodiment, the well is configured to hold a single beadhaving the amplified nucleic acid hybridized thereto. In even morepreferred embodiments, the well comprises a single bead having theamplified nucleic acid hybridized thereto as well as a plurality ofbeads that are smaller in size than the single bead. In especiallypreferred embodiments, the wells are configured to accommodate aplurality of beads that are smaller in size than the single bead. Ineven more preferred embodiments, a portion of the small bead populationin the well comprises beads having one or more enzymes attached. In evenmore preferred embodiments, a portion of the small bead population inthe well comprises blank beads or beads having neither a nucleic acidnor protein attached thereto.

In some embodiments of the above-described solid supports, thepopulation of nucleic acids includes alleles of a genetic locus from apolyploid organism. In other embodiments, the population of nucleicacids includes alternative splicing forms of a nucleic acid. In yetother embodiments, the population of nucleic acids includes alleles of agenetic locus from a diploid organism.

In any of the above-described embodiments of beads, mixtures of beadsand/or other solid supports described herein that comprise nucleicacids, a population of nucleic acids or a subpopulation of nucleic acidspresent on the bead, can be detected in aggregate. In some embodiments,the nucleic acids are associated with the bead. In other embodiments,the nucleic acids are attached at the bead.

EXAMPLES

Embodiments of the present disclosure are disclosed in further detail inthe following examples, which are not in any way intended to limit thepresent disclosure.

Example 1 Detecting Molecules in Aggregate

In some embodiments, two different types of molecules, for exampleallele 1 and allele 2, associated with a site on a solid support aredetected in aggregate. The aggregate signals observed, i.e., theobservations (O), are generated from a mixture of allele 1 and allele 2.As such, the signal observed is the combined signal from a mixture oftypes of molecules (templates) t₁ and t₂ with the weighted sum of signalfrom pure templates. The observations (O) are arranged in a columnmatrix with notation: O=tf, where “t” represents the template matrixhaving non-negative, integer entries reflecting the amount of signalgenerated from each type of molecule and “f”, which represents thefraction matrix corresponding the template matrix, is constrained suchthat all columns that sum to 1.

With observations O, values for t and f can be calculated by firstguessing the fraction matrix and using the variation associated withthat guess to calculate an estimate for the template matrix using aleast squares estimation as follows:

o=tf

t _(new) =of ^(T)(ff ^(T))⁻¹

Subsequent to calculating the estimate, elements of t are set to zero ifnegative and rounded to integers if not negative.

Alternatively, the methods described herein can be used to firstestimate fractions starting from a template matrix. Such a process issimilar to that indicated above, in that the template matrix is firstguessed. The variation associated with that guess is used to calculatean estimate for the fraction matrix using the following least squaresestimation:

O=tf

f _(new)=(t ^(T) t)⁻¹ t ^(T) o

Subsequent to calculating the estimate, elements of f are set to zero ifnegative, and each column is divided by its sum.

Now that starting estimates have been obtained for each matrix, theseestimates can be updated until convergence on a solution set is reached.Here a Fienup algorithm is used to implement the updating steps using apersonal computer as follows:

x=(t,f) with the convention that P_(A)(x) updates only the template t,and P_(B)(x) updates only the fractions f. The Fienup iteration isdefined as:

x−>x+P _(A)(2P _(B)(x)−x)−P _(B)(x)

The update is iterated until convergence and a solution set for thetemplate and fraction matrices is obtained.

Example 2 Detecting Molecules in Aggregate in Single Flow Systems

In some embodiments, only one flow varies and only one signal (O) isobserved. In such cases, matrix methods are not employed. In suchsituations, 0 can be modeled as:

O=IL _(A) X+IL _(B)(N−X), where

L_(A) is the length for allele A

L_(B) is the length for allele B

I is the intensity per molecule

N is the number of molecules associated on the site, and

X is the number of molecules of allele A on the site

Under this model, an estimate of O/I (the observation divided by theintensity factor) and the variance of that estimate can be set forth asfollows:

${E\left\lbrack \frac{O}{I} \right\rbrack} = {{L_{A}{Np}} + {L_{B}{N\left( {1 - p} \right)}}}$${{Var}{\frac{O}{I}}} = {{{Np}\left( {1 - p} \right)}\left( {L_{A} - L_{B}} \right)^{2}}$

where p is the probability that a molecule at the site includes alleleA.

Here, the variance of the observation depends on the difference betweenallele nucleotide run lengths, which allows inference of the combinationof alleles by statistical analysis of the observation for the site.

Example 3 Computer-Simulations

The following simulations were run to demonstrate that the methodsdescribed herein were robust enough to deconvolve mixed signals producedsets of molecules, having different types of variant regions. In orderto generate molecule pairs appropriate for such simulations, a firstnucleic acid sequence was generated at random. Next a mutation type waschosen at random from one of the following mutation types: (1)nucleotide substitution; (2) nucleotide insertion; (3) nucleotidedeletion; and (4) random nucleotide sequence.

In cases where a nucleotide substitution was selected, a small randomnumber was generated and was used to determine the number of basesreplaced with different bases to generate a second sequence from thefirst. The second sequence was employed as a nucleotide substitutionvariant of the first nucleotide sequence.

In cases where a nucleotide deletion was selected, a short nucleotidesequence was deleted from the first sequence, and then, the same numberof random nucleotides were appended to the sequence in order to generatea second sequence having the same length as the first sequence. Thesecond sequence was employed as a nucleotide sequence deletion variantof the first nucleotide sequence.

In cases where a nucleotide insertion was selected, a short nucleotidesequence was introduced into the first sequence, and then, the samenumber of nucleotides were removed from the sequence in order togenerate a second sequence having the same length as the first sequence.The second sequence was employed as a nucleotide sequence insertionvariant of the first nucleotide sequence.

In cases where the second sequence was random as compared to the firstsequence, the second sequence was generated at random, independentlyfrom the first

Multiple sequence pairs were generated for each of the four categoriesabove. The sequences in each pair were from approximately 75 toapproximately 200 nucleotides in length. For each sequence pair,sequencing flows were simulated and several thousand random proportionswere chosen according the binomial distribution. A number of simulatedmixed flows were generated for various proportions of the two sequences,forming the observation matrix. This observation matrix was used asinput to the algorithms described in Examples 1 or 2 depending on thetype of sequencing run that was simulated. The optimization was run andthe resulting output sequences were compared to the input sequences forthe final evaluation.

In nearly all of the simulations, convergence was reached in less than300 iterations. The output sequences closely matched the input sequenceshaving an error rate on the order of 10⁻¹¹.

Example 4 Detection of HIV Variants in a Patient

DNA is extracted from the blood sample obtained from an HIV patient. AnHIV marker gene, such as gp41 and p24, is amplified from the HIV'spatient's DNA and attached to beads for sequencing. The methodsdescribed herein are used to detect whether there are at least twodifferent variants of the HIV marker gene present in the patient's DNAsample. The presence of genetic variation(s) in the HIV marker gene canbe used to indicate the progression of HIV infection in the patient,determine the effectiveness of prior HIV treatment, and/or improveprognosis to the patient.

Example 5 Quality Control of Wastewater Treatment Plant

DNA is extracted from a water sample obtained from a wastewatertreatment plant. Primers specific for amplifying the 16S rRNA genes frombacteria that are commonly found in human feces and sewage samples areused to amplify 16S rRNA genes from the DNA obtained from the watersample and attached to beads for sequencing. The methods describedherein are used to detect the presence of one or more bacteria ofinterest, such as the presence of different types of fecal coliforms.The presence of any bacteria of interest is an indication of fecalcontamination at the wastewater treatment plant.

The above description discloses several methods and systems of thepresent invention. This invention is susceptible to modifications in themethods and materials, as well as alterations in the fabrication methodsand equipment. Such modifications will become apparent to those skilledin the art from a consideration of this disclosure or practice of theinvention disclosed herein. Consequently, it is not intended that thisinvention be limited to the specific embodiments disclosed herein, butthat it cover all modifications and alternatives coming within the truescope and spirit of the invention.

All references cited herein including, but not limited to, published andunpublished applications, patents, and literature references, areincorporated herein by reference in their entirety and are hereby made apart of this specification. To the extent publications and patents orpatent applications incorporated by reference contradict the disclosurecontained in the specification, the specification is intended tosupersede and/or take precedence over any such contradictory material.

The term “comprising” as used herein is synonymous with “including,”“containing,” or “characterized by,” and is inclusive or open-ended anddoes not exclude additional, unrecited elements or method steps.

What is claimed is:
 1. A method of detecting molecules, said methodcomprising the steps of: (a) providing a solid support comprisingmolecules associated with a site on the solid support such that themolecules are detected in aggregate during a detection step, wherein thesite comprises at least two different types of molecules; (b) detectinga signal corresponding to the aggregate of molecules at the site; (c)estimating the fraction of different types of molecules at the site orestimating the amount of signal corresponding to different types ofmolecules at the site; (d) calculating the amount of signalcorresponding to different types of molecules at the site using thefraction estimate, thereby obtaining a signal estimate or calculatingthe fraction of different types of molecules at the site using thesignal estimate, thereby obtaining a fraction estimate; and (e)iteratively updating the fraction estimate and signal estimate until theestimates converge, thereby detecting molecules associated with thesite.
 2. The method of claim 1, wherein the providing step furthercomprises attaching the molecules at the site.
 3. The method of claim 1,wherein the estimating step is performed by guessing the fraction ofdifferent types of molecules at the site or guessing the amount ofsignal corresponding to different types of molecules at the site.
 4. Themethod of claim 1, wherein the providing step further comprisesassociating the molecules with the site.
 5. The method of claim 1,wherein the estimating step comprises performing a principal componentanalysis (PCA).
 6. The method of claim 1, wherein the updating stepcomprises performing a numerical optimization algorithm.
 7. The methodof claim 6, wherein the numerical optimization algorithm is based on aniterative map search.
 8. The method of claim 7, wherein the numericaloptimization algorithm is based on Fienup's iteration map.
 9. The methodof claim 1, wherein sequence data is obtained for one or more molecules.10. The method of claim 10, wherein sequence data is obtained by asequencing-by-synthesis process.
 11. The method of claim 1, whereinabout 1,000 to about 10,000 molecules are associated with the site. 12.The method of claim 1, wherein the molecules comprise nucleic acids. 13.The method of claim 12, wherein the nucleic acids comprise a firstsubpopulation of nucleic acids and a second subpopulation of nucleicacids, said nucleic acids of the first subpopulation each having anidentical target region and the nucleic acids of the secondsubpopulation each having an identical region that is a variant of thetarget region.
 14. The method of claim 12, wherein the nucleotidesequence of said target region of the nucleic acids of the firstsubpopulation has at least 3 nucleotides that are different as comparedto the nucleotide sequence of said variant of the target region of thenucleic acids of the second subpopulation.
 15. The method of claim 12,wherein the nucleic acids comprise alleles of a genetic locus from apolyploid organism.
 16. The method of claim 12, wherein the nucleicacids comprise alternative splicing forms of a nucleic acid.
 17. Themethod of claim 12, the nucleic acids comprises alleles of a geneticlocus from a diploid organism.
 18. A molecule detection systemcomprising: a solid support comprising molecules associated with a siteon the solid support such that the molecules are detected in aggregate,wherein the molecules comprise at least two different types ofmolecules; and a detector configured to detect said molecules associatedwith said site.
 19. A method of identifying a target region of a nucleicacid, said method comprising: (a) associating a first subpopulation ofnucleic acids with a site on a solid support, wherein nucleic acids ofsaid first subpopulation comprise an identical target region; (b)associating a second subpopulation of nucleic acids with the site on thesolid support, wherein nucleic acids of said second subpopulationcomprise an identical target region that is a variant of the targetregion of the nucleic acids of said first subpopulation; (c) detecting asignal corresponding to one or more nucleotides of the target region offirst subpopulation nucleic acids and one or more nucleotides of thevariant of the target region of second subpopulation nucleic acids; (d)estimating the fraction of first subpopulation nucleic acids and secondsubpopulation nucleic acids associated with the site or estimating theamount of signal corresponding to first subpopulation nucleic acids andsecond subpopulation nucleic acids associated with the site; (e)calculating the amount of signal corresponding to first subpopulationnucleic acids and second subpopulation nucleic acids associated with thesite using the fraction estimate, or calculating the fraction of firstsubpopulation nucleic acids and second subpopulation nucleic acidsassociated with the site using the signal estimate; and (f) iterativelyupdating the fraction estimate and signal estimate until the estimatesconverge, thereby identifying a target region of a nucleic acid.
 20. Amethod for identifying a biosignature, said method comprising the stepsof: (a) providing samples obtained from a plurality of subjects, whereinthe samples comprise molecules; (b) tagging molecules from the samplesso as to identify the subject from which each sample originated; (c)associating molecules from the samples with a site on a solid supportsuch that the molecules are detected in aggregate during a detectionstep, wherein the site comprises at least two different types ofmolecules; (d) obtaining a biosignature for molecules associated withthe site by: i) detecting a signal corresponding to the aggregate of themolecules at the site; ii) estimating the fraction of different types ofmolecules at the site or the amount of signal corresponding to differenttypes of molecules at the site; iii) calculating the amount of signalcorresponding to different types of molecules at the site using thefraction estimate, or calculating the fraction of different types ofmolecules at the site using the signal estimate; and iv) iterativelyupdating the fraction estimate and signal estimate until the estimatesconverge, thereby obtaining a biosignature for molecules at the site;and (e) comparing the biosignature obtained in step (d) to a referencebiosignature, thereby identifying said bio signature.